Understanding the CosFace Paper
The paper “CosFace: Large Margin Cosine Loss for Deep Face Recognition” proposes a new loss function called Large Margin Cosine Loss, abbreviated as LMCL. The model trained using this loss is called CosFace. The main goal of the paper is to learn highly discriminative face features, so that images of the same person are placed close together and images of different people are placed far apart in the learned feature space.
Although the paper is written for face recognition, its core idea belongs to the broader family of metric learning. It is about teaching a neural network what “similar” and “different” should mean for a specific recognition task.
1. The Problem Addressed by the Paper
In face recognition, a model usually has to solve two related tasks. The first is face verification, where the system decides whether two face images belong to the same person. The second is face identification, where the system identifies a person from among many known identities.
Deep convolutional neural networks can extract powerful face features. However, the traditional softmax loss is not always sufficient for face recognition. Softmax can help the model classify training images correctly, but it may not create a feature space where unseen faces are separated clearly.
In simple words, ordinary softmax may learn to classify faces, but it may not learn a strong enough face similarity space.
This becomes important because face recognition systems often compare two face embeddings using cosine similarity. Therefore, the authors argue that the training objective should also be aligned with cosine similarity.
2. The Main Idea of CosFace
The main idea of CosFace is to normalize the features and the class weight vectors, and then force the correct class to be better than the wrong classes by a fixed cosine margin.
In a normal classification layer, the logit for class \(j\) can be written as:
\[ f_j = W_j^T x \]
This dot product can also be expressed as:
\[ f_j = \|W_j\| \|x\| \cos \theta_j \]
Here, \(W_j\) is the weight vector of class \(j\), \(x\) is the feature vector of the input image, and \(\theta_j\) is the angle between \(W_j\) and \(x\). The term \(\cos \theta_j\) measures angular similarity between the feature and the class weight.
The authors point out that ordinary softmax depends on both the vector magnitude and the angle. But in face recognition, the final comparison is usually done through cosine similarity. Therefore, the angle is more important than the magnitude.
CosFace normalizes the class weight vectors:
\[ \|W_j\| = 1 \]
It also normalizes the feature vector to a fixed scale:
\[ \|x\| = s \]
After this normalization, classification depends mainly on cosine similarity.
3. From Softmax to Normalized Softmax
The ordinary softmax loss is:
\[ L_s = \frac{1}{N} \sum_{i=1}^{N} -\log \frac{e^{f_{y_i}}} {\sum_{j=1}^{C} e^{f_j}} \]
After normalization, the logit becomes based on cosine similarity. The normalized softmax loss can be written as:
\[ L_{ns} = \frac{1}{N} \sum_i -\log \frac{e^{s \cos(\theta_{y_i,i})}} {\sum_j e^{s \cos(\theta_{j,i})}} \]
The paper refers to this as Normalized Softmax Loss, or NSL. However, NSL only requires the correct class to have a higher cosine value than the wrong class.
For example, for the correct class \(y_i\), it only requires:
\[ \cos(\theta_{y_i}) > \cos(\theta_j) \]
This means the correct class only needs to win slightly. The authors argue that this is not enough for highly discriminative face recognition.
4. The CosFace / LMCL Equation
CosFace introduces a fixed cosine margin \(m\) into the cosine value of the correct class. The Large Margin Cosine Loss is written as:
\[ L_{lmc} = \frac{1}{N} \sum_i -\log \frac{ e^{s(\cos(\theta_{y_i,i}) - m)} }{ e^{s(\cos(\theta_{y_i,i}) - m)} + \sum_{j \neq y_i} e^{s\cos(\theta_{j,i})} } \]
This is the key equation of the paper. Its meaning is simple: the correct class should not merely be greater than the wrong class. It should be greater by a margin.
In ordinary normalized softmax, for a sample belonging to class 1, the condition may be:
\[ \cos(\theta_1) > \cos(\theta_2) \]
CosFace makes the condition stricter:
\[ \cos(\theta_1) - m > \cos(\theta_2) \]
This can also be written as:
\[ \cos(\theta_1) > \cos(\theta_2) + m \]
The correct class does not merely have to win. It has to win by a clear cosine margin.
5. Simple Example
Suppose a face image belongs to Person A. The model compares this face with Person A and Person B.
| Person | Cosine Similarity |
|---|---|
| Person A | 0.71 |
| Person B | 0.70 |
With ordinary softmax, Person A wins, so the model may be satisfied. But the difference is very small. CosFace says this is not enough.
If the margin is \(m = 0.35\), then CosFace wants:
\[ \cos(\theta_A) - 0.35 > \cos(\theta_B) \]
This forces the network to create much stronger separation between identities.
6. Why Normalization is Important
The paper strongly emphasizes normalization. Without normalization, the model can use the length of the feature vector to solve the classification problem. Easy samples may get large feature norms, while hard samples may behave differently.
This weakens angular discrimination. By normalizing the weights and features, CosFace removes radial variation. The feature vectors are placed on a hypersphere, and the model must learn better angular separation.
The model cannot simply change the length of the feature vector. It must learn better angles.
This is important because face recognition usually compares two faces using cosine similarity. Therefore, CosFace makes the training objective more consistent with the testing method.
7. Difference Between Softmax, NSL, A-Softmax, and LMCL
| Loss Function | Main Idea | Limitation or Advantage |
|---|---|---|
| Softmax | Classify samples correctly | Not discriminative enough for face similarity |
| Normalized Softmax Loss | Normalize weights and features | Improves angular learning but has no margin |
| A-Softmax / SphereFace | Adds angular margin | Optimization is harder because of the angular formulation |
| LMCL / CosFace | Adds fixed cosine margin | Simpler and directly aligned with cosine similarity |
The authors argue that LMCL is effective because it adds the margin directly in cosine space:
\[ \cos(\theta_y) - m \]
This creates a clearer and more stable decision boundary.
8. What Figure 1 Shows
Figure 1 presents the overall CosFace framework. During training, face images are passed through a convolutional neural network. The LMCL loss guides the network to learn features with a large margin between different identities.
During testing, the trained network extracts face features. These features are then compared using cosine similarity for face verification or face identification.
The important point is that both training and testing are based on cosine similarity.
9. What Figure 2 Shows
Figure 2 compares the decision margins of different loss functions. Softmax has a weak or even overlapping margin in cosine space. Normalized Softmax Loss improves the situation by normalizing the vectors, but it still has zero margin.
A-Softmax creates a margin in angular space, but the margin is not uniform. LMCL creates a clear margin directly in cosine space. This visual comparison supports the paper’s main argument that CosFace provides a cleaner and more consistent decision boundary.
10. What Figure 4 Shows
Figure 4 presents a toy experiment using 8 identities and 2D features. The authors show that ordinary softmax produces more ambiguous feature distributions. As the LMCL margin \(m\) increases, the classes become more clearly separated in angular space.
This supports the idea that the cosine margin makes the learned features more discriminative.
11. Experimental Results
The authors test CosFace on major face recognition benchmarks, including LFW, YTF, MegaFace Challenge 1, and MegaFace Challenge 2. The results show that CosFace achieves state-of-the-art or highly competitive performance.
In the paper’s comparisons, LMCL performs better than several earlier losses such as ordinary softmax, triplet loss, center loss, L-Softmax, and A-Softmax on multiple benchmarks.
One important finding is that the margin parameter \(m\) improves performance up to a point. In the paper, performance improves as \(m\) increases and saturates around:
\[ m = 0.35 \]
If \(m\) becomes too large, training becomes difficult and the model may fail to converge.
12. Main Contributions of the Paper
The first contribution of the paper is the proposal of Large Margin Cosine Loss, a simple and effective loss function for face recognition.
The second contribution is the use of normalization for both feature vectors and class weight vectors. This helps the model focus on angular discrimination instead of feature magnitude.
The third contribution is the strong experimental performance on standard face recognition benchmarks.
13. One-Sentence Summary
CosFace teaches a deep face-recognition model to produce embeddings where the correct identity is not just closer than other identities, but closer by a fixed cosine margin.
14. Relevance to Saree Provenance Classification
Although this paper is about face recognition, the idea is highly relevant for fine-grained saree classification. In saree provenance classification, many classes can look visually similar. Two sarees may share similar colours, borders, zari work, or motifs, but belong to different craft traditions.
A normal classifier may learn broad visual differences. A margin-based metric learning method such as CosFace can help create a more disciplined embedding space, where sarees from the same craft cluster are placed close together and sarees from different clusters are pushed farther apart.
For saree provenance classification, the equivalent idea would be:
\[ \text{Same saree cluster} \Rightarrow \text{closer embeddings} \]
\[ \text{Different saree clusters} \Rightarrow \text{farther embeddings with margin} \]
Therefore, CosFace is useful not only as a face recognition method, but also as a general idea for learning stronger separation between visually similar categories.
No comments:
Post a Comment