Friday, 2 May 2025

The Sleeve and Neck Recognition Paper: Spatial-Aware Non-Local Attention for Fashion Landmark Detection by Li et al.

 Please go for this link

๐Ÿ” Problem Addressed:

Fashion landmark detection is challenging due to large variation in clothing appearances and non-rigid deformations. Existing deep learning approaches struggle to effectively capture global context and focus on relevant clothing areas.


๐Ÿง  Proposed Solution:

The authors propose a new module called the Spatial-Aware Non-Local (SANL) block:

  • Built on the non-local neural network framework, which captures global dependencies.

  • Enhances spatial awareness by incorporating attention maps generated using Grad-CAM from a pretrained classifier.

  • Helps the model focus on key garment regions (e.g., sleeves, hems) to better detect landmarks.


๐Ÿ—️ Model Architecture:

  1. Backbone: Feature Pyramid Network (FPN) with ResNet-101.

  2. SANL blocks: Inserted at four feature levels (strides 4, 8, 16, 32).

  3. Attention maps: Generated using Grad-CAM on a ResNet-18 classifier pretrained on DeepFashion categories.

  4. Coarse-to-Fine Design:

    • CoarseNet: Learns from broader (larger-ฯƒ) heatmaps.

    • FineNet: Refines predictions using narrower (smaller-ฯƒ) heatmaps.


๐Ÿงช Experiments:

  • Datasets: DeepFashion-C and FLD (Fashion Landmark Dataset)

  • Metric: Normalized Error (NE) – lower is better.

  • The proposed method achieves state-of-the-art performance with NE:

    • 0.0299 (DeepFashion-C)

    • 0.0396 (FLD)

  • Outperforms previous methods like FashionNet, DFA, DLAN, and AFGN.


๐Ÿ”ฌ Ablation Studies:

  • SANL blocks improve performance over base and non-local models.

  • Using spatial attention from corresponding feature strides yields better accuracy than using only high-level maps.

  • Training only on visible landmarks and using SANL and Coarse-to-Fine together gives the best results.


๐Ÿ”„ Generalization:

  • SANL blocks also improved performance on fine-grained classification tasks:

    • CUB-200 (Birds) and FoodNet (Chinese dishes)

    • Gains in Top-1 and Top-3 accuracy over baseline and non-local versions.


Conclusion:

  • SANL blocks effectively blend spatial attention and global context.

  • They improve landmark detection and generalize well to other spatially-sensitive tasks.

  • No extra annotation needed—only Grad-CAM-based attention maps are used.



No comments:

Post a Comment

๐Ÿง  You Only Laugh Once: Creativity and Humor in Deep Learning Community

It all started with a simple truth: Attention Is All You Need . Or at least, that’s what the transformers keep whispering at every AI confer...