My Research Notes: Understanding the GCN Paper

Understanding the GCN Paper

The paper “Semi-Supervised Classification with Graph Convolutional Networks” was written by Thomas N. Kipf and Max Welling and published at ICLR 2017. It is one of the foundational papers on Graph Convolutional Networks, commonly called GCNs.

The paper proposes a simple and scalable neural network model that can classify nodes in a graph by using both node features and graph structure. Its central idea is that a node should not be classified only by its own features. It should also learn from the features of its neighbouring nodes.

1. What Problem is This Paper Solving?

The paper focuses on semi-supervised node classification. This means we have a graph where only a few nodes have labels, while many nodes are unlabeled. The task is to predict the labels of the unlabeled nodes.

For example, in a citation network, the graph may look like this:

Graph Element	Example
Node	Research paper
Edge	Citation link between papers
Node feature	Words appearing in the paper
Node label	Research topic or category

So the central question is:

Given a citation network where only some papers have topic labels, can we predict the topic of the remaining papers using both their text features and citation links?

This is different from ordinary classification because the data points are not independent. They are connected to one another through graph edges.

2. Why Graphs are Important Here

In ordinary machine learning, we may classify a document using only its own features, such as the words in the document. But in a graph, we also know how documents are connected.

For example, if a paper cites many machine-learning papers, or is cited by many machine-learning papers, that graph structure gives useful information. Therefore, the model should use both:

\[ X = \text{node feature matrix} \]

and:

\[ A = \text{adjacency matrix of the graph} \]

Symbol	Meaning
\(X\)	Feature matrix of all nodes
\(A\)	Adjacency matrix showing graph connections
\(Y\)	Labels for some nodes
\(N\)	Number of nodes

The goal is to learn a function:

\[ f(X,A) \]

This function predicts node labels using both node features and graph connections.

3. What Earlier Methods Did

Earlier graph-based semi-supervised learning methods often used graph regularization. A common assumption was that connected nodes should have similar labels.

This can be written using a graph Laplacian regularization term:

\[ L = L_0 + \lambda L_{\text{reg}} \]

where:

\[ L_{\text{reg}} = \sum_{i,j} A_{ij}\|f(X_i)-f(X_j)\|^2 \]

This says that if node \(i\) and node \(j\) are connected, their predictions should be close. However, the authors point out a limitation. Graph edges do not always mean “same label.” Sometimes an edge may represent relationship, influence, dependency, citation, or other structural information.

Therefore, instead of merely forcing connected nodes to have similar labels, the paper proposes using the graph structure directly inside a neural network.

4. The Main Idea of GCN

The core idea of a Graph Convolutional Network is that a node should update its representation by combining information from its neighbours.

In a normal neural network, each data point is processed separately. In a GCN, each node receives information from nearby nodes. If node \(i\) is connected to nodes \(j\), \(k\), and \(l\), then the representation of node \(i\) should be influenced by the features of \(j\), \(k\), and \(l\).

In simple words, a GCN lets each node “listen” to its neighbours before making a prediction.

5. The Famous GCN Propagation Rule

The most important equation in the paper is the layer-wise propagation rule:

\[ H^{(l+1)} = \sigma \left( \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)} \right) \]

This is the heart of the GCN paper.

Symbol	Meaning
\(H^{(l)}\)	Node representations at layer \(l\)
\(H^{(0)} = X\)	The first layer input is the original node feature matrix
\(W^{(l)}\)	Trainable weight matrix at layer \(l\)
\(\sigma\)	Activation function, such as ReLU
\(A\)	Original adjacency matrix
\(\tilde{A} = A + I_N\)	Adjacency matrix with self-connections added
\(\tilde{D}\)	Degree matrix of \(\tilde{A}\)

This equation means that the model takes each node’s features, aggregates information from its neighbours, normalizes the aggregation, applies a trainable transformation, and then applies a non-linear activation.

6. Why Add Self-Connections?

The paper defines:

\[ \tilde{A} = A + I_N \]

This means the graph adjacency matrix is modified by adding self-loops. The reason is simple: when a node updates its representation, it should not only use its neighbours’ information. It should also keep its own information.

A node should listen to its neighbours, but it should also remember itself.

7. Why Normalize the Adjacency Matrix?

The paper uses the normalized adjacency form:

\[ \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \]

Normalization is needed because some nodes may have many neighbours, while others may have only a few. Without normalization, high-degree nodes could dominate the feature aggregation.

Normalization prevents highly connected nodes from overwhelming the message-passing process.

8. Two-Layer GCN Model

For semi-supervised node classification, the paper uses a two-layer GCN:

\[ Z = f(X,A) = \text{softmax} \left( \hat{A} \, \text{ReLU} \left( \hat{A}XW^{(0)} \right) W^{(1)} \right) \]

where:

\[ \hat{A} = \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \]

This means the model performs two rounds of graph-based information mixing. After one GCN layer, each node has information from its immediate neighbours. After two GCN layers, each node can receive information from two-hop neighbours.

In simple form:

\[ \text{node} \rightarrow \text{neighbour} \rightarrow \text{neighbour's neighbour} \]

9. The Loss Function

Although the graph contains many nodes, only some nodes have labels. Therefore, the loss is calculated only on labeled nodes:

\[ L = - \sum_{l \in Y_L} \sum_{f=1}^{F} Y_{lf}\ln Z_{lf} \]

Symbol	Meaning
\(Y_L\)	Set of labeled nodes
\(F\)	Number of classes
\(Y_{lf}\)	True label indicator
\(Z_{lf}\)	Predicted probability

This is standard cross-entropy loss, but evaluated only on labeled nodes. The clever part is that even though the loss is calculated only on labeled nodes, the graph convolution spreads information through the graph. Therefore, unlabeled nodes also influence the learned representations.

10. What Figure 1 Shows

Figure 1 in the paper is very useful. The left side shows a multi-layer GCN where the same graph structure is shared across layers. The input nodes pass through hidden layers and produce output predictions.

The right side shows a t-SNE visualization of hidden-layer activations on the Cora dataset. The colors represent document classes. The visualization shows that the GCN learns hidden representations that group documents by class.

In simple words, Figure 1 shows that after training, the GCN creates node representations where similar classes become visually clustered.

11. Where the “Convolution” Idea Comes From

In images, a convolution combines information from nearby pixels. In graphs, there is no regular grid like an image. So the question is: what is the equivalent of nearby pixels in a graph?

The answer is:

\[ \text{nearby pixels in images} \approx \text{neighbouring nodes in graphs} \]

So graph convolution means combining a node’s feature with features from its graph neighbourhood.

12. Spectral Motivation in Simple Terms

The authors begin from spectral graph convolutions, which involve the graph Laplacian:

\[ L = I_N - D^{-\frac{1}{2}}AD^{-\frac{1}{2}} \]

In spectral graph theory, convolution can be defined using eigenvectors of the graph Laplacian. However, directly computing this is expensive:

\[ O(N^2) \]

Therefore, the authors use approximations and simplifications to get a much more scalable model. The final GCN layer has complexity linear in the number of edges:

\[ O(|E|) \]

This is important because real graphs can be very large.

13. The Renormalization Trick

One important technical contribution is the renormalization trick. The authors begin with a simplified graph convolution, but repeated application can cause numerical instability.

So they replace:

\[ I_N + D^{-\frac{1}{2}}AD^{-\frac{1}{2}} \]

with:

\[ \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \]

where:

\[ \tilde{A} = A + I_N \]

This trick makes training more stable and became the standard formulation of GCNs.

14. Datasets Used

The paper tests the model on citation networks and a knowledge graph dataset.

Dataset	Type	Nodes	Edges	Classes
Citeseer	Citation network	3,327	4,732	6
Cora	Citation network	2,708	5,429	7
Pubmed	Citation network	19,717	44,338	3
NELL	Knowledge graph	65,755	266,144	210

In the citation networks, nodes are documents and edges are citation links. For training, the authors use only a small number of labels, such as 20 labels per class for citation datasets.

15. Results

The paper compares GCN with methods such as Label Propagation, DeepWalk, ICA, Planetoid, manifold regularization, and semi-supervised embedding. The GCN performs better than the baselines on the main datasets.

Method	Citeseer	Cora	Pubmed	NELL
Planetoid	64.7	75.7	77.2	61.9
GCN	70.3	81.5	79.0	66.0

These results show that GCN improves both classification accuracy and efficiency.

16. Why GCN Works

GCN works because it learns from two sources at the same time:

\[ \text{node features} + \text{graph structure} \]

For a paper in a citation network, the model uses the words in the paper, the papers it cites, the papers that cite it, the features of neighbouring papers, and the broader local graph structure.

This makes the learned representation richer than using document text alone.

17. Relation to the Weisfeiler-Lehman Algorithm

In the appendix, the paper connects GCNs to the Weisfeiler-Lehman graph isomorphism algorithm. The Weisfeiler-Lehman algorithm updates a node label by aggregating labels from neighbouring nodes.

Similarly, a GCN updates a node representation by aggregating feature vectors from neighbouring nodes:

\[ h_i^{(l+1)} = \sigma \left( \sum_{j \in N_i} \frac{1}{c_{ij}} h_j^{(l)} W^{(l)} \right) \]

This gives a useful interpretation: a GCN is like a differentiable, trainable version of neighbourhood aggregation.

18. Limitations of the Paper

The authors also discuss some limitations. First, the model uses full-batch gradient descent, so memory grows with graph size. Second, the model is mainly designed for undirected graphs. Directed edges and edge features are not naturally handled, although the NELL experiment shows one workaround.

Third, deeper GCNs can become harder to train. In Appendix B, the paper shows that 2-layer or 3-layer models work best for the datasets studied. Very deep GCNs may over-smooth representations or become difficult to optimize.

19. Relevance to Saree Provenance Classification

This paper is very relevant to saree provenance classification, especially if visual models are combined with structured textile knowledge.

For example, a saree knowledge graph could contain nodes such as:

Node Type	Example
Saree image	Image of a Kanjivaram saree
Craft cluster	Kanjivaram, Banarasi, Gadwal
Motif	Peacock, paisley, temple border
Weave structure	Brocade, jamdani, ikat
Region	Tamil Nadu, Varanasi, Telangana
Material	Silk, cotton, zari

Edges could represent relationships such as:

Edge	Meaning
saree-has-motif	A saree contains a motif
saree-from-region	A saree belongs to a region
craft-uses-technique	A craft uses a weaving technique
motif-associated-with-cluster	A motif is associated with a craft cluster

A GCN could learn from both:

\[ \text{visual features} \]

and:

\[ \text{relational textile knowledge} \]

This is exactly why GCNs matter for provenance classification. They help the model reason beyond image pixels.

20. One-Sentence Summary

This paper introduces Graph Convolutional Networks, a neural network method that classifies nodes by repeatedly combining each node’s own features with normalized information from its graph neighbours.

In very simple words, a GCN lets every node learn from its neighbours, so even with few labels, the model can use the structure of the whole graph to make better predictions.

```

My Research Notes

Thursday, 4 June 2026

Understanding the GCN Paper

Understanding the GCN Paper

1. What Problem is This Paper Solving?

2. Why Graphs are Important Here

3. What Earlier Methods Did

4. The Main Idea of GCN

5. The Famous GCN Propagation Rule

6. Why Add Self-Connections?

7. Why Normalize the Adjacency Matrix?

8. Two-Layer GCN Model

9. The Loss Function

10. What Figure 1 Shows

11. Where the “Convolution” Idea Comes From

12. Spectral Motivation in Simple Terms

13. The Renormalization Trick

14. Datasets Used

15. Results

16. Why GCN Works

17. Relation to the Weisfeiler-Lehman Algorithm

18. Limitations of the Paper

19. Relevance to Saree Provenance Classification

20. One-Sentence Summary

No comments:

Post a Comment

Understading the Paper: Fine Grained Image Analysis with Deep Learning