Both streams are frozen for the first 5 epochs (to retain generic facial priors) and then fine‑tuned jointly. For each level ℓ ∈ 1,2,3, we compute an attention map A ⁽ℓ⁾ that modulates the contribution of the two streams:
[ \mathbfA^(\ell) = \sigma\big( \textConv_1\times1\big([ \mathbfF_G^(\ell); \mathbfF_D^(\ell) ]\big) \big), ] Deep Cheeks 2
where σ denotes the sigmoid activation and [;] denotes channel‑wise concatenation. The fused feature is: Both streams are frozen for the first 5
| # | Contribution | Impact | |---|--------------|--------| | 1 | Dual‑stream multi‑scale architecture with AGSC | Improves robustness to pose/occlusion (↑ 8.7 % IoU) | | 2 | Cheek‑specific Dice loss + Perceptual Aesthetic loss | Aligns predictions with human perception (↑ 12.4 % correlation) | | 3 | CheekWILD‑2 dataset (45 k images, 23 k masks, 22 k scores) | Provides the largest public resource for cheek‑centric research | | 4 | Open‑source implementation (PyTorch, GPL‑3) | Facilitates reproducibility and downstream applications | \mathbfF_D^(\ell) ]\big) \big)