My Research Notes: Understanding the CosFace Paper

Understanding the CosFace Paper

The paper “CosFace: Large Margin Cosine Loss for Deep Face Recognition” proposes a new loss function called Large Margin Cosine Loss, abbreviated as LMCL. The model trained using this loss is called CosFace. The main goal of the paper is to learn highly discriminative face features, so that images of the same person are placed close together and images of different people are placed far apart in the learned feature space.

Although the paper is written for face recognition, its core idea belongs to the broader family of metric learning. It is about teaching a neural network what “similar” and “different” should mean for a specific recognition task.

1. The Problem Addressed by the Paper

In face recognition, a model usually has to solve two related tasks. The first is face verification, where the system decides whether two face images belong to the same person. The second is face identification, where the system identifies a person from among many known identities.

Deep convolutional neural networks can extract powerful face features. However, the traditional softmax loss is not always sufficient for face recognition. Softmax can help the model classify training images correctly, but it may not create a feature space where unseen faces are separated clearly.

In simple words, ordinary softmax may learn to classify faces, but it may not learn a strong enough face similarity space.

This becomes important because face recognition systems often compare two face embeddings using cosine similarity. Therefore, the authors argue that the training objective should also be aligned with cosine similarity.

2. The Main Idea of CosFace

The main idea of CosFace is to normalize the features and the class weight vectors, and then force the correct class to be better than the wrong classes by a fixed cosine margin.

In a normal classification layer, the logit for class \(j\) can be written as:

\[ f_j = W_j^T x \]

This dot product can also be expressed as:

\[ f_j = \|W_j\| \|x\| \cos \theta_j \]

Here, \(W_j\) is the weight vector of class \(j\), \(x\) is the feature vector of the input image, and \(\theta_j\) is the angle between \(W_j\) and \(x\). The term \(\cos \theta_j\) measures angular similarity between the feature and the class weight.

The authors point out that ordinary softmax depends on both the vector magnitude and the angle. But in face recognition, the final comparison is usually done through cosine similarity. Therefore, the angle is more important than the magnitude.

CosFace normalizes the class weight vectors:

\[ \|W_j\| = 1 \]

It also normalizes the feature vector to a fixed scale:

\[ \|x\| = s \]

After this normalization, classification depends mainly on cosine similarity.

3. From Softmax to Normalized Softmax

The ordinary softmax loss is:

\[ L_s = \frac{1}{N} \sum_{i=1}^{N} -\log \frac{e^{f_{y_i}}} {\sum_{j=1}^{C} e^{f_j}} \]

After normalization, the logit becomes based on cosine similarity. The normalized softmax loss can be written as:

\[ L_{ns} = \frac{1}{N} \sum_i -\log \frac{e^{s \cos(\theta_{y_i,i})}} {\sum_j e^{s \cos(\theta_{j,i})}} \]

The paper refers to this as Normalized Softmax Loss, or NSL. However, NSL only requires the correct class to have a higher cosine value than the wrong class.

For example, for the correct class \(y_i\), it only requires:

\[ \cos(\theta_{y_i}) > \cos(\theta_j) \]

This means the correct class only needs to win slightly. The authors argue that this is not enough for highly discriminative face recognition.

4. The CosFace / LMCL Equation

CosFace introduces a fixed cosine margin \(m\) into the cosine value of the correct class. The Large Margin Cosine Loss is written as:

\[ L_{lmc} = \frac{1}{N} \sum_i -\log \frac{ e^{s(\cos(\theta_{y_i,i}) - m)} }{ e^{s(\cos(\theta_{y_i,i}) - m)} + \sum_{j \neq y_i} e^{s\cos(\theta_{j,i})} } \]

This is the key equation of the paper. Its meaning is simple: the correct class should not merely be greater than the wrong class. It should be greater by a margin.

In ordinary normalized softmax, for a sample belonging to class 1, the condition may be:

\[ \cos(\theta_1) > \cos(\theta_2) \]

CosFace makes the condition stricter:

\[ \cos(\theta_1) - m > \cos(\theta_2) \]

This can also be written as:

\[ \cos(\theta_1) > \cos(\theta_2) + m \]

The correct class does not merely have to win. It has to win by a clear cosine margin.

5. Simple Example

Suppose a face image belongs to Person A. The model compares this face with Person A and Person B.

Person	Cosine Similarity
Person A	0.71
Person B	0.70

With ordinary softmax, Person A wins, so the model may be satisfied. But the difference is very small. CosFace says this is not enough.

If the margin is \(m = 0.35\), then CosFace wants:

\[ \cos(\theta_A) - 0.35 > \cos(\theta_B) \]

This forces the network to create much stronger separation between identities.

6. Why Normalization is Important

The paper strongly emphasizes normalization. Without normalization, the model can use the length of the feature vector to solve the classification problem. Easy samples may get large feature norms, while hard samples may behave differently.

This weakens angular discrimination. By normalizing the weights and features, CosFace removes radial variation. The feature vectors are placed on a hypersphere, and the model must learn better angular separation.

The model cannot simply change the length of the feature vector. It must learn better angles.

This is important because face recognition usually compares two faces using cosine similarity. Therefore, CosFace makes the training objective more consistent with the testing method.

7. Difference Between Softmax, NSL, A-Softmax, and LMCL

Loss Function	Main Idea	Limitation or Advantage
Softmax	Classify samples correctly	Not discriminative enough for face similarity
Normalized Softmax Loss	Normalize weights and features	Improves angular learning but has no margin
A-Softmax / SphereFace	Adds angular margin	Optimization is harder because of the angular formulation
LMCL / CosFace	Adds fixed cosine margin	Simpler and directly aligned with cosine similarity

The authors argue that LMCL is effective because it adds the margin directly in cosine space:

\[ \cos(\theta_y) - m \]

This creates a clearer and more stable decision boundary.

8. What Figure 1 Shows

Figure 1 presents the overall CosFace framework. During training, face images are passed through a convolutional neural network. The LMCL loss guides the network to learn features with a large margin between different identities.

During testing, the trained network extracts face features. These features are then compared using cosine similarity for face verification or face identification.

The important point is that both training and testing are based on cosine similarity.

9. What Figure 2 Shows

Figure 2 compares the decision margins of different loss functions. Softmax has a weak or even overlapping margin in cosine space. Normalized Softmax Loss improves the situation by normalizing the vectors, but it still has zero margin.

A-Softmax creates a margin in angular space, but the margin is not uniform. LMCL creates a clear margin directly in cosine space. This visual comparison supports the paper’s main argument that CosFace provides a cleaner and more consistent decision boundary.

10. What Figure 4 Shows

Figure 4 presents a toy experiment using 8 identities and 2D features. The authors show that ordinary softmax produces more ambiguous feature distributions. As the LMCL margin \(m\) increases, the classes become more clearly separated in angular space.

This supports the idea that the cosine margin makes the learned features more discriminative.

11. Experimental Results

The authors test CosFace on major face recognition benchmarks, including LFW, YTF, MegaFace Challenge 1, and MegaFace Challenge 2. The results show that CosFace achieves state-of-the-art or highly competitive performance.

In the paper’s comparisons, LMCL performs better than several earlier losses such as ordinary softmax, triplet loss, center loss, L-Softmax, and A-Softmax on multiple benchmarks.

One important finding is that the margin parameter \(m\) improves performance up to a point. In the paper, performance improves as \(m\) increases and saturates around:

\[ m = 0.35 \]

If \(m\) becomes too large, training becomes difficult and the model may fail to converge.

12. Main Contributions of the Paper

The first contribution of the paper is the proposal of Large Margin Cosine Loss, a simple and effective loss function for face recognition.

The second contribution is the use of normalization for both feature vectors and class weight vectors. This helps the model focus on angular discrimination instead of feature magnitude.

The third contribution is the strong experimental performance on standard face recognition benchmarks.

13. One-Sentence Summary

CosFace teaches a deep face-recognition model to produce embeddings where the correct identity is not just closer than other identities, but closer by a fixed cosine margin.

14. Relevance to Saree Provenance Classification

Although this paper is about face recognition, the idea is highly relevant for fine-grained saree classification. In saree provenance classification, many classes can look visually similar. Two sarees may share similar colours, borders, zari work, or motifs, but belong to different craft traditions.

A normal classifier may learn broad visual differences. A margin-based metric learning method such as CosFace can help create a more disciplined embedding space, where sarees from the same craft cluster are placed close together and sarees from different clusters are pushed farther apart.

For saree provenance classification, the equivalent idea would be:

\[ \text{Same saree cluster} \Rightarrow \text{closer embeddings} \]

\[ \text{Different saree clusters} \Rightarrow \text{farther embeddings with margin} \]

Therefore, CosFace is useful not only as a face recognition method, but also as a general idea for learning stronger separation between visually similar categories.

My Research Notes

Thursday, 4 June 2026

Understanding the CosFace Paper

Understanding the CosFace Paper

1. The Problem Addressed by the Paper

2. The Main Idea of CosFace

3. From Softmax to Normalized Softmax

4. The CosFace / LMCL Equation

5. Simple Example

6. Why Normalization is Important

7. Difference Between Softmax, NSL, A-Softmax, and LMCL

8. What Figure 1 Shows

9. What Figure 2 Shows

10. What Figure 4 Shows

11. Experimental Results

12. Main Contributions of the Paper

13. One-Sentence Summary

14. Relevance to Saree Provenance Classification

No comments:

Post a Comment

Understading the Paper: Fine Grained Image Analysis with Deep Learning