๐ Problem Addressed:
Fashion landmark detection is challenging due to large variation in clothing appearances and non-rigid deformations. Existing deep learning approaches struggle to effectively capture global context and focus on relevant clothing areas.
๐ง Proposed Solution:
The authors propose a new module called the Spatial-Aware Non-Local (SANL) block:
-
Built on the non-local neural network framework, which captures global dependencies.
-
Enhances spatial awareness by incorporating attention maps generated using Grad-CAM from a pretrained classifier.
-
Helps the model focus on key garment regions (e.g., sleeves, hems) to better detect landmarks.
๐️ Model Architecture:
-
Backbone: Feature Pyramid Network (FPN) with ResNet-101.
-
SANL blocks: Inserted at four feature levels (strides 4, 8, 16, 32).
-
Attention maps: Generated using Grad-CAM on a ResNet-18 classifier pretrained on DeepFashion categories.
-
Coarse-to-Fine Design:
-
CoarseNet: Learns from broader (larger-ฯ) heatmaps.
-
FineNet: Refines predictions using narrower (smaller-ฯ) heatmaps.
-
๐งช Experiments:
-
Datasets: DeepFashion-C and FLD (Fashion Landmark Dataset)
-
Metric: Normalized Error (NE) – lower is better.
-
The proposed method achieves state-of-the-art performance with NE:
-
0.0299 (DeepFashion-C)
-
0.0396 (FLD)
-
-
Outperforms previous methods like FashionNet, DFA, DLAN, and AFGN.
๐ฌ Ablation Studies:
-
SANL blocks improve performance over base and non-local models.
-
Using spatial attention from corresponding feature strides yields better accuracy than using only high-level maps.
-
Training only on visible landmarks and using SANL and Coarse-to-Fine together gives the best results.
๐ Generalization:
-
SANL blocks also improved performance on fine-grained classification tasks:
-
CUB-200 (Birds) and FoodNet (Chinese dishes)
-
Gains in Top-1 and Top-3 accuracy over baseline and non-local versions.
-
✅ Conclusion:
-
SANL blocks effectively blend spatial attention and global context.
-
They improve landmark detection and generalize well to other spatially-sensitive tasks.
-
No extra annotation needed—only Grad-CAM-based attention maps are used.
No comments:
Post a Comment