My Research Notes: The Fashionnet Paper: DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations" (CVPR 2016) by Ziwei Liu et al

Wednesday, 30 April 2025

The Fashionnet Paper: DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations" (CVPR 2016) by Ziwei Liu et al

🔍 Objective

To overcome limitations in clothing recognition tasks due to fragmented, small, or weakly labeled datasets by introducing a large-scale, richly annotated dataset—DeepFashion—and a novel model called FashionNet.

🗂️ DeepFashion Dataset

Scale: >800,000 images
Annotations:
- 50 fine-grained clothing categories
- 1,000 clothing attributes (texture, fabric, shape, part, style)
- 4–8 clothing landmarks per image (e.g., collar, sleeve ends, hems)
- 300,000 cross-pose/cross-domain image pairs (e.g., shop vs. street)
Sources: Online shops (Forever21, Mogujie) and Google Images
Benchmarks Supported:
1. Attribute Prediction
2. In-shop Clothes Retrieval
3. Consumer-to-Shop Clothes Retrieval

🧠 FashionNet Architecture

Based on VGG-16, with three branches:
1. Global Appearance Branch
2. Local Landmark-Guided Branch
3. Pose Estimation Branch (predicts landmark locations & visibility)
Landmark Pooling Layer: Pools/gates features using predicted landmarks, improving robustness to deformation and occlusion.

🔁 Training Approach

Multi-task loss optimization:
- Softmax loss for categories and visibility
- Cross-entropy loss for attribute prediction
- Regression loss for landmark localization
- Triplet loss for retrieval learning
Iterative Training: First focuses on landmark prediction, then on attribute and category learning using pooled features.

📊 Key Results

FashionNet outperforms prior methods like WTBI and DARN:
- Category classification: Top-3 accuracy of 82.58%
- Attribute prediction: Best across all five attribute groups
- In-shop retrieval: Top-20 accuracy of 76.4% (vs. 67.5% for DARN)
- Consumer-to-shop retrieval: Top-20 accuracy of 18.8% (70% higher than DARN)
Ablation studies show:
- Using clothing landmarks > human joints/poselets
- Using more attributes improves model performance

🧩 Contributions

DeepFashion Dataset: Largest and most comprehensively annotated fashion dataset to date.
FashionNet: A deep model integrating attribute and landmark learning for robust clothing feature extraction.
Benchmarks and Protocols: Defined for consistent evaluation in classification and retrieval tasks.

My Research Notes