My Research Notes: The Sleeve and Neck Recognition Paper: Spatial-Aware Non-Local Attention for Fashion Landmark Detection by Li et al.

Friday, 2 May 2025

The Sleeve and Neck Recognition Paper: Spatial-Aware Non-Local Attention for Fashion Landmark Detection by Li et al.

🔍 Problem Addressed:

Fashion landmark detection is challenging due to large variation in clothing appearances and non-rigid deformations. Existing deep learning approaches struggle to effectively capture global context and focus on relevant clothing areas.

🧠 Proposed Solution:

The authors propose a new module called the Spatial-Aware Non-Local (SANL) block:

Built on the non-local neural network framework, which captures global dependencies.
Enhances spatial awareness by incorporating attention maps generated using Grad-CAM from a pretrained classifier.
Helps the model focus on key garment regions (e.g., sleeves, hems) to better detect landmarks.

🏗️ Model Architecture:

Backbone: Feature Pyramid Network (FPN) with ResNet-101.
SANL blocks: Inserted at four feature levels (strides 4, 8, 16, 32).
Attention maps: Generated using Grad-CAM on a ResNet-18 classifier pretrained on DeepFashion categories.
Coarse-to-Fine Design:
- CoarseNet: Learns from broader (larger-σ) heatmaps.
- FineNet: Refines predictions using narrower (smaller-σ) heatmaps.

🧪 Experiments:

Datasets: DeepFashion-C and FLD (Fashion Landmark Dataset)
Metric: Normalized Error (NE) – lower is better.
The proposed method achieves state-of-the-art performance with NE:
- 0.0299 (DeepFashion-C)
- 0.0396 (FLD)
Outperforms previous methods like FashionNet, DFA, DLAN, and AFGN.

🔬 Ablation Studies:

SANL blocks improve performance over base and non-local models.
Using spatial attention from corresponding feature strides yields better accuracy than using only high-level maps.
Training only on visible landmarks and using SANL and Coarse-to-Fine together gives the best results.

🔄 Generalization:

SANL blocks also improved performance on fine-grained classification tasks:
- CUB-200 (Birds) and FoodNet (Chinese dishes)
- Gains in Top-1 and Top-3 accuracy over baseline and non-local versions.

✅ Conclusion:

SANL blocks effectively blend spatial attention and global context.
They improve landmark detection and generalize well to other spatially-sensitive tasks.
No extra annotation needed—only Grad-CAM-based attention maps are used.

My Research Notes