Understanding the Graph Attention Networks Paper
The paper “Graph Attention Networks” was written by Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. It introduces Graph Attention Networks, commonly called GATs.
The key idea of the paper is simple but powerful: when a node receives information from its neighbours, not all neighbours should be treated equally. The model should learn which neighbours deserve more attention.
1. The Problem This Paper Addresses
Earlier graph neural networks, especially Graph Convolutional Networks or GCNs, aggregate information from neighbouring nodes. In a GCN, a node collects information from its neighbours using a normalized average-like operation.
This is useful, but it has one limitation:
All neighbours are treated in a largely fixed, structurally determined way.
In many graphs, different neighbours may have different importance. For example, in a citation network, a paper may cite ten papers. Some citations may be central to the paper’s topic, while others may be only background references.
In a saree knowledge graph, a saree may be connected to many attributes: colour, motif, region, material, border type, pallu type, weaving technique, and price range. For provenance classification, some neighbours may matter more than others. A motif or weave technique may matter more than colour.
So the paper asks:
Can the model learn how much attention to give to each neighbour?
Graph Attention Networks answer this question with a clear yes.
2. The Main Idea of Graph Attention Networks
The main idea is:
\[ \text{Node representation} = \text{weighted combination of neighbour representations} \]
Unlike GCN, the weights are not fixed only by graph degree normalization. They are learned using an attention mechanism.
For a node \(i\), GAT learns attention weights:
\[ \alpha_{ij} \]
Here, \(\alpha_{ij}\) means:
How important is neighbour \(j\) for updating node \(i\)?
This is the central concept of the paper.
3. Why Attention is Useful in Graphs
Attention helps because graph neighbourhoods are irregular. One node may have two neighbours, while another node may have two hundred neighbours. Also, neighbours do not have a natural order.
Attention solves this by assigning a learned importance score to each neighbour.
A GAT lets a node decide which neighbours to listen to more carefully.
This makes the model more flexible than a simple average.
4. Input and Output of a GAT Layer
Suppose we have \(N\) nodes. Each node has a feature vector:
\[ h_i \in \mathbb{R}^{F} \]
The input to a GAT layer is:
\[ h = \{h_1, h_2, \ldots, h_N\} \]
The output is a new set of node features:
\[ h' = \{h'_1, h'_2, \ldots, h'_N\} \]
where:
\[ h'_i \in \mathbb{R}^{F'} \]
So the layer transforms each node from \(F\)-dimensional features to \(F'\)-dimensional features.
5. Linear Transformation of Node Features
First, each node feature vector is transformed using a shared weight matrix:
\[ W \in \mathbb{R}^{F' \times F} \]
So each node becomes:
\[ Wh_i \]
This is similar to a normal neural network layer. The same \(W\) is applied to every node, which gives weight sharing.
6. Attention Score Between Two Nodes
For node \(i\) and its neighbour \(j\), GAT computes a raw attention score:
\[ e_{ij} = a(Wh_i, Wh_j) \]
This score tells us how important node \(j\) is to node \(i\).
In the paper, the attention mechanism uses a small feedforward neural network. Expanded, it becomes:
\[ e_{ij} = \text{LeakyReLU} \left( a^T[Wh_i \Vert Wh_j] \right) \]
| Symbol | Meaning |
|---|---|
| \(h_i\) | Feature vector of node \(i\) |
| \(h_j\) | Feature vector of neighbour node \(j\) |
| \(W\) | Shared learnable weight matrix |
| \(a^T\) | Learnable attention vector |
| \(\Vert\) | Concatenation |
| \(e_{ij}\) | Raw attention score from node \(i\) to node \(j\) |
The concatenation:
\[ [Wh_i \Vert Wh_j] \]
means the model looks at node \(i\) and node \(j\) together before deciding how important \(j\) is for \(i\).
7. Masked Attention
In a fully connected attention model, every node could attend to every other node. But in a graph, GAT uses masked attention. This means node \(i\) attends only to its graph neighbours:
\[ j \in N_i \]
where \(N_i\) is the neighbourhood of node \(i\).
The graph structure acts as a mask. The model does not directly consider unrelated nodes. This keeps the model efficient and respects the graph structure.
8. Softmax Normalization of Attention Scores
The raw attention scores \(e_{ij}\) are normalized using softmax:
\[ \alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\exp(e_{ij})} {\sum_{k \in N_i}\exp(e_{ik})} \]
This makes the attention scores comparable across all neighbours of node \(i\).
After softmax:
\[ \sum_{j \in N_i} \alpha_{ij} = 1 \]
So \(\alpha_{ij}\) behaves like an importance weight.
9. Updating the Node Representation
Once attention weights are computed, GAT updates node \(i\) by taking a weighted sum of transformed neighbour features:
\[ h'_i = \sigma \left( \sum_{j \in N_i} \alpha_{ij}Wh_j \right) \]
This is the core GAT update equation.
Node \(i\)'s new representation is built from its neighbours, but important neighbours contribute more.
10. Simple Example
Suppose a paper node is connected to three cited papers:
| Neighbour | Relevance to Paper Topic | Attention Weight |
|---|---|---|
| Paper A | Very relevant | \(0.70\) |
| Paper B | Somewhat relevant | \(0.20\) |
| Paper C | Weakly relevant | \(0.10\) |
A GCN may average them structurally. A GAT can learn that Paper A should contribute more strongly.
Similarly, in a saree graph:
| Neighbour | Example | Possible Importance |
|---|---|---|
| Motif | Temple border | High |
| Technique | Brocade weaving | High |
| Colour | Red | Medium or low |
| Price band | ₹10,000–₹15,000 | Low |
| Region | Tamil Nadu | High |
A GAT can learn that some attributes are more useful for provenance classification than others.
11. Multi-Head Attention
The paper also uses multi-head attention. Instead of using one attention mechanism, GAT uses several independent attention heads.
For \(K\) attention heads, the output can be concatenated:
\[ h'_i = \mathbin\Vert_{k=1}^{K} \sigma \left( \sum_{j \in N_i} \alpha_{ij}^{k}W^{k}h_j \right) \]
Here, each head \(k\) has its own attention weights and transformation matrix.
Different attention heads can learn different kinds of relationships.
For example, in a saree knowledge graph:
| Attention Head | May Learn to Focus On |
|---|---|
| Head 1 | Motif relationships |
| Head 2 | Weave relationships |
| Head 3 | Regional relationships |
| Head 4 | Material relationships |
In the final prediction layer, the paper averages the heads instead of concatenating them:
\[ h'_i = \sigma \left( \frac{1}{K} \sum_{k=1}^{K} \sum_{j \in N_i} \alpha_{ij}^{k}W^{k}h_j \right) \]
12. What Figure 1 Shows
Figure 1 in the paper is the key visual explanation. The left side shows how the attention mechanism computes \(\alpha_{ij}\) using transformed node features \(Wh_i\) and \(Wh_j\). The right side shows multi-head attention, where node 1 attends to its neighbourhood through multiple independent attention heads.
Figure 1 shows how a node looks at its neighbours through multiple attention lenses.
13. Why GAT Improves Over GCN
GCN and GAT both aggregate neighbourhood information, but they differ in how neighbour importance is decided.
| Aspect | GCN | GAT |
|---|---|---|
| Neighbour weighting | Mostly fixed by graph degree normalization | Learned through attention |
| Importance of neighbours | Similar structural treatment | Different learned importance |
| Interpretability | Limited | Attention weights can be inspected |
| Inductive use | Possible but less natural in original spectral framing | More naturally applicable |
| Need for expensive spectral operations | No in simplified GCN, but derived from spectral theory | No spectral operation needed |
GAT’s main advantage is that it can learn:
\[ \text{Neighbour } j \text{ is more important than neighbour } k \]
for a particular node \(i\).
14. Computational Efficiency
The paper emphasizes that GAT does not require costly matrix operations such as graph Laplacian eigendecomposition.
A single attention head has time complexity:
\[ O(|V|FF' + |E|F') \]
| Symbol | Meaning |
|---|---|
| \(|V|\) | Number of nodes |
| \(|E|\) | Number of edges |
| \(F\) | Input feature dimension |
| \(F'\) | Output feature dimension |
This makes GAT suitable for graph-structured data without requiring expensive spectral computations.
15. Transductive and Inductive Learning
The paper evaluates GAT in both transductive and inductive settings.
| Setting | Meaning |
|---|---|
| Transductive learning | The graph is known during training, but only some node labels are used |
| Inductive learning | The model must generalize to new graphs that were not seen during training |
The paper tests GAT on:
| Dataset | Task Type |
|---|---|
| Cora | Transductive |
| Citeseer | Transductive |
| Pubmed | Transductive |
| PPI | Inductive |
In the PPI dataset, the test graphs are completely unseen during training. This shows that GAT is not limited to one fixed graph.
16. Dataset Details
The paper summarizes the datasets as follows:
| Dataset | Nodes | Edges | Features per Node | Classes |
|---|---|---|---|---|
| Cora | 2,708 | 5,429 | 1,433 | 7 |
| Citeseer | 3,327 | 4,732 | 3,703 | 6 |
| Pubmed | 19,717 | 44,338 | 500 | 3 |
| PPI | 56,944 across 24 graphs | 818,716 | 50 | 121 multilabel |
For citation datasets, nodes are papers and edges are citation links. For PPI, nodes are proteins and edges represent protein-protein interactions.
17. Results
The paper reports strong performance. For transductive node classification, GAT achieves:
| Dataset | GCN | GAT |
|---|---|---|
| Cora | \(81.5\%\) | \(83.0 \pm 0.7\%\) |
| Citeseer | \(70.3\%\) | \(72.5 \pm 0.7\%\) |
| Pubmed | \(79.0\%\) | \(79.0 \pm 0.3\%\) |
For the inductive PPI task, GAT achieves:
\[ 0.973 \pm 0.002 \]
micro-F1, compared with:
\[ 0.934 \pm 0.006 \]
for the constant-attention version and:
\[ 0.612 \]
for the best reported original GraphSAGE variant in the table.
The improvement on PPI is especially important because it shows that attention is very useful when generalizing to unseen graphs.
18. What Figure 2 Shows
Figure 2 on page 9 shows a t-SNE visualization of feature representations learned by the first hidden layer of a GAT model on the Cora dataset. The node colours represent classes, and the edge thickness represents the strength of aggregated attention coefficients.
Figure 2 shows that GAT learns a graph representation where nodes of the same class tend to group together.
19. Why Attention Gives Interpretability
One interesting advantage of GAT is that attention weights can be inspected. If node \(i\) gives high attention to node \(j\), then we can ask why that neighbour mattered.
For saree provenance work, this is important. A GAT could reveal whether the model is attending more to motif, border, weave, pallu, region, material, or similar saree examples.
For example:
\[ \alpha_{\text{saree, motif}} = 0.42 \]
\[ \alpha_{\text{saree, region}} = 0.31 \]
\[ \alpha_{\text{saree, colour}} = 0.05 \]
This would suggest that the model found motif and region more important than colour for that prediction.
20. Relation to GCN and GraphSAGE
GCN, GraphSAGE, and GAT are closely related, but each paper solves a different problem.
| Paper | Core Idea | Main Limitation Addressed |
|---|---|---|
| GCN | Aggregate normalized neighbour information | Uses fixed structural weighting |
| GraphSAGE | Sample and aggregate neighbours for inductive learning | Handles unseen nodes |
| GAT | Learn attention weights over neighbours | Learns which neighbours matter more |
GCN says “learn from neighbours,” GraphSAGE says “sample neighbours for unseen nodes,” and GAT says “learn which neighbours are important.”
21. Relevance to Saree Provenance Classification
GAT is highly relevant to saree provenance classification if image embeddings are combined with a structured textile knowledge graph.
Imagine a graph where nodes include:
| Node Type | Example |
|---|---|
| Saree image | Image of a Kanjivaram saree |
| Craft cluster | Kanjivaram, Banarasi, Paithani |
| Region | Tamil Nadu, Varanasi, Maharashtra |
| Motif | Peacock, paisley, temple, floral |
| Technique | Brocade, ikat, jamdani |
| Material | Silk, cotton, zari |
| Border type | Korvai, temple border, zari border |
Edges may include:
| Edge Type | Meaning |
|---|---|
| saree-has-motif | A saree contains a motif |
| saree-from-region | A saree belongs to a region |
| craft-uses-technique | A craft uses weaving technique |
| saree-has-border | A saree has a border type |
| motif-associated-with-craft | A motif is common in a craft |
For a new saree, GAT can learn which neighbouring nodes are important for classification.
For example, for one saree:
\[ \alpha_{\text{weave}} > \alpha_{\text{colour}} \]
For another saree:
\[ \alpha_{\text{motif}} > \alpha_{\text{material}} \]
This is useful because saree identity is not determined by one feature alone. Different craft clusters may require attention to different cues.
22. Main Limitations
The paper notes some limitations and future directions. First, practical implementation with sparse matrix operations can limit batching, especially for datasets with multiple graphs.
Second, the model’s receptive field is still limited by depth. A two-layer GAT only reaches two-hop neighbourhoods.
Third, the paper does not incorporate edge features directly. This matters because in many knowledge graphs, the edge type itself carries meaning. For example:
\[ \text{saree} \rightarrow \text{motif} \]
and:
\[ \text{saree} \rightarrow \text{region} \]
are not the same kind of relation. Later graph attention models and relational GNNs address this more directly.
23. One-Sentence Summary
Graph Attention Networks learn node representations by allowing each node to assign different attention weights to different neighbours, so the model can decide which neighbouring information is most important for prediction.
In very simple words, GAT lets every node listen more carefully to the neighbours that matter most.
No comments:
Post a Comment