Understanding the GCN Paper
The paper “Semi-Supervised Classification with Graph Convolutional Networks” was written by Thomas N. Kipf and Max Welling and published at ICLR 2017. It is one of the foundational papers on Graph Convolutional Networks, commonly called GCNs.
The paper proposes a simple and scalable neural network model that can classify nodes in a graph by using both node features and graph structure. Its central idea is that a node should not be classified only by its own features. It should also learn from the features of its neighbouring nodes.
1. What Problem is This Paper Solving?
The paper focuses on semi-supervised node classification. This means we have a graph where only a few nodes have labels, while many nodes are unlabeled. The task is to predict the labels of the unlabeled nodes.
For example, in a citation network, the graph may look like this:
| Graph Element | Example |
|---|---|
| Node | Research paper |
| Edge | Citation link between papers |
| Node feature | Words appearing in the paper |
| Node label | Research topic or category |
So the central question is:
Given a citation network where only some papers have topic labels, can we predict the topic of the remaining papers using both their text features and citation links?
This is different from ordinary classification because the data points are not independent. They are connected to one another through graph edges.
2. Why Graphs are Important Here
In ordinary machine learning, we may classify a document using only its own features, such as the words in the document. But in a graph, we also know how documents are connected.
For example, if a paper cites many machine-learning papers, or is cited by many machine-learning papers, that graph structure gives useful information. Therefore, the model should use both:
\[ X = \text{node feature matrix} \]
and:
\[ A = \text{adjacency matrix of the graph} \]
| Symbol | Meaning |
|---|---|
| \(X\) | Feature matrix of all nodes |
| \(A\) | Adjacency matrix showing graph connections |
| \(Y\) | Labels for some nodes |
| \(N\) | Number of nodes |
The goal is to learn a function:
\[ f(X,A) \]
This function predicts node labels using both node features and graph connections.
3. What Earlier Methods Did
Earlier graph-based semi-supervised learning methods often used graph regularization. A common assumption was that connected nodes should have similar labels.
This can be written using a graph Laplacian regularization term:
\[ L = L_0 + \lambda L_{\text{reg}} \]
where:
\[ L_{\text{reg}} = \sum_{i,j} A_{ij}\|f(X_i)-f(X_j)\|^2 \]
This says that if node \(i\) and node \(j\) are connected, their predictions should be close. However, the authors point out a limitation. Graph edges do not always mean “same label.” Sometimes an edge may represent relationship, influence, dependency, citation, or other structural information.
Therefore, instead of merely forcing connected nodes to have similar labels, the paper proposes using the graph structure directly inside a neural network.
4. The Main Idea of GCN
The core idea of a Graph Convolutional Network is that a node should update its representation by combining information from its neighbours.
In a normal neural network, each data point is processed separately. In a GCN, each node receives information from nearby nodes. If node \(i\) is connected to nodes \(j\), \(k\), and \(l\), then the representation of node \(i\) should be influenced by the features of \(j\), \(k\), and \(l\).
In simple words, a GCN lets each node “listen” to its neighbours before making a prediction.
5. The Famous GCN Propagation Rule
The most important equation in the paper is the layer-wise propagation rule:
\[ H^{(l+1)} = \sigma \left( \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)} \right) \]
This is the heart of the GCN paper.
| Symbol | Meaning |
|---|---|
| \(H^{(l)}\) | Node representations at layer \(l\) |
| \(H^{(0)} = X\) | The first layer input is the original node feature matrix |
| \(W^{(l)}\) | Trainable weight matrix at layer \(l\) |
| \(\sigma\) | Activation function, such as ReLU |
| \(A\) | Original adjacency matrix |
| \(\tilde{A} = A + I_N\) | Adjacency matrix with self-connections added |
| \(\tilde{D}\) | Degree matrix of \(\tilde{A}\) |
This equation means that the model takes each node’s features, aggregates information from its neighbours, normalizes the aggregation, applies a trainable transformation, and then applies a non-linear activation.
6. Why Add Self-Connections?
The paper defines:
\[ \tilde{A} = A + I_N \]
This means the graph adjacency matrix is modified by adding self-loops. The reason is simple: when a node updates its representation, it should not only use its neighbours’ information. It should also keep its own information.
A node should listen to its neighbours, but it should also remember itself.
7. Why Normalize the Adjacency Matrix?
The paper uses the normalized adjacency form:
\[ \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \]
Normalization is needed because some nodes may have many neighbours, while others may have only a few. Without normalization, high-degree nodes could dominate the feature aggregation.
Normalization prevents highly connected nodes from overwhelming the message-passing process.
8. Two-Layer GCN Model
For semi-supervised node classification, the paper uses a two-layer GCN:
\[ Z = f(X,A) = \text{softmax} \left( \hat{A} \, \text{ReLU} \left( \hat{A}XW^{(0)} \right) W^{(1)} \right) \]
where:
\[ \hat{A} = \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \]
This means the model performs two rounds of graph-based information mixing. After one GCN layer, each node has information from its immediate neighbours. After two GCN layers, each node can receive information from two-hop neighbours.
In simple form:
\[ \text{node} \rightarrow \text{neighbour} \rightarrow \text{neighbour's neighbour} \]
9. The Loss Function
Although the graph contains many nodes, only some nodes have labels. Therefore, the loss is calculated only on labeled nodes:
\[ L = - \sum_{l \in Y_L} \sum_{f=1}^{F} Y_{lf}\ln Z_{lf} \]
| Symbol | Meaning |
|---|---|
| \(Y_L\) | Set of labeled nodes |
| \(F\) | Number of classes |
| \(Y_{lf}\) | True label indicator |
| \(Z_{lf}\) | Predicted probability |
This is standard cross-entropy loss, but evaluated only on labeled nodes. The clever part is that even though the loss is calculated only on labeled nodes, the graph convolution spreads information through the graph. Therefore, unlabeled nodes also influence the learned representations.
10. What Figure 1 Shows
Figure 1 in the paper is very useful. The left side shows a multi-layer GCN where the same graph structure is shared across layers. The input nodes pass through hidden layers and produce output predictions.
The right side shows a t-SNE visualization of hidden-layer activations on the Cora dataset. The colors represent document classes. The visualization shows that the GCN learns hidden representations that group documents by class.
In simple words, Figure 1 shows that after training, the GCN creates node representations where similar classes become visually clustered.
11. Where the “Convolution” Idea Comes From
In images, a convolution combines information from nearby pixels. In graphs, there is no regular grid like an image. So the question is: what is the equivalent of nearby pixels in a graph?
The answer is:
\[ \text{nearby pixels in images} \approx \text{neighbouring nodes in graphs} \]
So graph convolution means combining a node’s feature with features from its graph neighbourhood.
12. Spectral Motivation in Simple Terms
The authors begin from spectral graph convolutions, which involve the graph Laplacian:
\[ L = I_N - D^{-\frac{1}{2}}AD^{-\frac{1}{2}} \]
In spectral graph theory, convolution can be defined using eigenvectors of the graph Laplacian. However, directly computing this is expensive:
\[ O(N^2) \]
Therefore, the authors use approximations and simplifications to get a much more scalable model. The final GCN layer has complexity linear in the number of edges:
\[ O(|E|) \]
This is important because real graphs can be very large.
13. The Renormalization Trick
One important technical contribution is the renormalization trick. The authors begin with a simplified graph convolution, but repeated application can cause numerical instability.
So they replace:
\[ I_N + D^{-\frac{1}{2}}AD^{-\frac{1}{2}} \]
with:
\[ \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \]
where:
\[ \tilde{A} = A + I_N \]
This trick makes training more stable and became the standard formulation of GCNs.
14. Datasets Used
The paper tests the model on citation networks and a knowledge graph dataset.
| Dataset | Type | Nodes | Edges | Classes |
|---|---|---|---|---|
| Citeseer | Citation network | 3,327 | 4,732 | 6 |
| Cora | Citation network | 2,708 | 5,429 | 7 |
| Pubmed | Citation network | 19,717 | 44,338 | 3 |
| NELL | Knowledge graph | 65,755 | 266,144 | 210 |
In the citation networks, nodes are documents and edges are citation links. For training, the authors use only a small number of labels, such as 20 labels per class for citation datasets.
15. Results
The paper compares GCN with methods such as Label Propagation, DeepWalk, ICA, Planetoid, manifold regularization, and semi-supervised embedding. The GCN performs better than the baselines on the main datasets.
| Method | Citeseer | Cora | Pubmed | NELL |
|---|---|---|---|---|
| Planetoid | 64.7 | 75.7 | 77.2 | 61.9 |
| GCN | 70.3 | 81.5 | 79.0 | 66.0 |
These results show that GCN improves both classification accuracy and efficiency.
16. Why GCN Works
GCN works because it learns from two sources at the same time:
\[ \text{node features} + \text{graph structure} \]
For a paper in a citation network, the model uses the words in the paper, the papers it cites, the papers that cite it, the features of neighbouring papers, and the broader local graph structure.
This makes the learned representation richer than using document text alone.
17. Relation to the Weisfeiler-Lehman Algorithm
In the appendix, the paper connects GCNs to the Weisfeiler-Lehman graph isomorphism algorithm. The Weisfeiler-Lehman algorithm updates a node label by aggregating labels from neighbouring nodes.
Similarly, a GCN updates a node representation by aggregating feature vectors from neighbouring nodes:
\[ h_i^{(l+1)} = \sigma \left( \sum_{j \in N_i} \frac{1}{c_{ij}} h_j^{(l)} W^{(l)} \right) \]
This gives a useful interpretation: a GCN is like a differentiable, trainable version of neighbourhood aggregation.
18. Limitations of the Paper
The authors also discuss some limitations. First, the model uses full-batch gradient descent, so memory grows with graph size. Second, the model is mainly designed for undirected graphs. Directed edges and edge features are not naturally handled, although the NELL experiment shows one workaround.
Third, deeper GCNs can become harder to train. In Appendix B, the paper shows that 2-layer or 3-layer models work best for the datasets studied. Very deep GCNs may over-smooth representations or become difficult to optimize.
19. Relevance to Saree Provenance Classification
This paper is very relevant to saree provenance classification, especially if visual models are combined with structured textile knowledge.
For example, a saree knowledge graph could contain nodes such as:
| Node Type | Example |
|---|---|
| Saree image | Image of a Kanjivaram saree |
| Craft cluster | Kanjivaram, Banarasi, Gadwal |
| Motif | Peacock, paisley, temple border |
| Weave structure | Brocade, jamdani, ikat |
| Region | Tamil Nadu, Varanasi, Telangana |
| Material | Silk, cotton, zari |
Edges could represent relationships such as:
| Edge | Meaning |
|---|---|
| saree-has-motif | A saree contains a motif |
| saree-from-region | A saree belongs to a region |
| craft-uses-technique | A craft uses a weaving technique |
| motif-associated-with-cluster | A motif is associated with a craft cluster |
A GCN could learn from both:
\[ \text{visual features} \]
and:
\[ \text{relational textile knowledge} \]
This is exactly why GCNs matter for provenance classification. They help the model reason beyond image pixels.
20. One-Sentence Summary
This paper introduces Graph Convolutional Networks, a neural network method that classifies nodes by repeatedly combining each node’s own features with normalized information from its graph neighbours.
In very simple words, a GCN lets every node learn from its neighbours, so even with few labels, the model can use the structure of the whole graph to make better predictions.
No comments:
Post a Comment