My Research Notes: Understanding the Graph Attention Networks Paper

Understanding the Graph Attention Networks Paper

The paper “Graph Attention Networks” was written by Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. It introduces Graph Attention Networks, commonly called GATs.

The key idea of the paper is simple but powerful: when a node receives information from its neighbours, not all neighbours should be treated equally. The model should learn which neighbours deserve more attention.

1. The Problem This Paper Addresses

Earlier graph neural networks, especially Graph Convolutional Networks or GCNs, aggregate information from neighbouring nodes. In a GCN, a node collects information from its neighbours using a normalized average-like operation.

This is useful, but it has one limitation:

All neighbours are treated in a largely fixed, structurally determined way.

In many graphs, different neighbours may have different importance. For example, in a citation network, a paper may cite ten papers. Some citations may be central to the paper’s topic, while others may be only background references.

In a saree knowledge graph, a saree may be connected to many attributes: colour, motif, region, material, border type, pallu type, weaving technique, and price range. For provenance classification, some neighbours may matter more than others. A motif or weave technique may matter more than colour.

So the paper asks:

Can the model learn how much attention to give to each neighbour?

Graph Attention Networks answer this question with a clear yes.

2. The Main Idea of Graph Attention Networks

The main idea is:

\[ \text{Node representation} = \text{weighted combination of neighbour representations} \]

Unlike GCN, the weights are not fixed only by graph degree normalization. They are learned using an attention mechanism.

For a node \(i\), GAT learns attention weights:

\[ \alpha_{ij} \]

Here, \(\alpha_{ij}\) means:

How important is neighbour \(j\) for updating node \(i\)?

This is the central concept of the paper.

3. Why Attention is Useful in Graphs

Attention helps because graph neighbourhoods are irregular. One node may have two neighbours, while another node may have two hundred neighbours. Also, neighbours do not have a natural order.

Attention solves this by assigning a learned importance score to each neighbour.

A GAT lets a node decide which neighbours to listen to more carefully.

This makes the model more flexible than a simple average.

4. Input and Output of a GAT Layer

Suppose we have \(N\) nodes. Each node has a feature vector:

\[ h_i \in \mathbb{R}^{F} \]

The input to a GAT layer is:

\[ h = \{h_1, h_2, \ldots, h_N\} \]

The output is a new set of node features:

\[ h' = \{h'_1, h'_2, \ldots, h'_N\} \]

where:

\[ h'_i \in \mathbb{R}^{F'} \]

So the layer transforms each node from \(F\)-dimensional features to \(F'\)-dimensional features.

5. Linear Transformation of Node Features

First, each node feature vector is transformed using a shared weight matrix:

\[ W \in \mathbb{R}^{F' \times F} \]

So each node becomes:

\[ Wh_i \]

This is similar to a normal neural network layer. The same \(W\) is applied to every node, which gives weight sharing.

6. Attention Score Between Two Nodes

For node \(i\) and its neighbour \(j\), GAT computes a raw attention score:

\[ e_{ij} = a(Wh_i, Wh_j) \]

This score tells us how important node \(j\) is to node \(i\).

In the paper, the attention mechanism uses a small feedforward neural network. Expanded, it becomes:

\[ e_{ij} = \text{LeakyReLU} \left( a^T[Wh_i \Vert Wh_j] \right) \]

Symbol	Meaning
\(h_i\)	Feature vector of node \(i\)
\(h_j\)	Feature vector of neighbour node \(j\)
\(W\)	Shared learnable weight matrix
\(a^T\)	Learnable attention vector
\(\Vert\)	Concatenation
\(e_{ij}\)	Raw attention score from node \(i\) to node \(j\)

The concatenation:

\[ [Wh_i \Vert Wh_j] \]

means the model looks at node \(i\) and node \(j\) together before deciding how important \(j\) is for \(i\).

7. Masked Attention

In a fully connected attention model, every node could attend to every other node. But in a graph, GAT uses masked attention. This means node \(i\) attends only to its graph neighbours:

\[ j \in N_i \]

where \(N_i\) is the neighbourhood of node \(i\).

The graph structure acts as a mask. The model does not directly consider unrelated nodes. This keeps the model efficient and respects the graph structure.

8. Softmax Normalization of Attention Scores

The raw attention scores \(e_{ij}\) are normalized using softmax:

\[ \alpha_{ij} = \text{softmax}_j(e_{ij}) = \frac{\exp(e_{ij})} {\sum_{k \in N_i}\exp(e_{ik})} \]

This makes the attention scores comparable across all neighbours of node \(i\).

After softmax:

\[ \sum_{j \in N_i} \alpha_{ij} = 1 \]

So \(\alpha_{ij}\) behaves like an importance weight.

9. Updating the Node Representation

Once attention weights are computed, GAT updates node \(i\) by taking a weighted sum of transformed neighbour features:

\[ h'_i = \sigma \left( \sum_{j \in N_i} \alpha_{ij}Wh_j \right) \]

This is the core GAT update equation.

Node \(i\)'s new representation is built from its neighbours, but important neighbours contribute more.

10. Simple Example

Suppose a paper node is connected to three cited papers:

Neighbour	Relevance to Paper Topic	Attention Weight
Paper A	Very relevant	\(0.70\)
Paper B	Somewhat relevant	\(0.20\)
Paper C	Weakly relevant	\(0.10\)

A GCN may average them structurally. A GAT can learn that Paper A should contribute more strongly.

Similarly, in a saree graph:

Neighbour	Example	Possible Importance
Motif	Temple border	High
Technique	Brocade weaving	High
Colour	Red	Medium or low
Price band	₹10,000–₹15,000	Low
Region	Tamil Nadu	High

A GAT can learn that some attributes are more useful for provenance classification than others.

11. Multi-Head Attention

The paper also uses multi-head attention. Instead of using one attention mechanism, GAT uses several independent attention heads.

For \(K\) attention heads, the output can be concatenated:

\[ h'_i = \mathbin\Vert_{k=1}^{K} \sigma \left( \sum_{j \in N_i} \alpha_{ij}^{k}W^{k}h_j \right) \]

Here, each head \(k\) has its own attention weights and transformation matrix.

Different attention heads can learn different kinds of relationships.

For example, in a saree knowledge graph:

Attention Head	May Learn to Focus On
Head 1	Motif relationships
Head 2	Weave relationships
Head 3	Regional relationships
Head 4	Material relationships

In the final prediction layer, the paper averages the heads instead of concatenating them:

\[ h'_i = \sigma \left( \frac{1}{K} \sum_{k=1}^{K} \sum_{j \in N_i} \alpha_{ij}^{k}W^{k}h_j \right) \]

12. What Figure 1 Shows

Figure 1 in the paper is the key visual explanation. The left side shows how the attention mechanism computes \(\alpha_{ij}\) using transformed node features \(Wh_i\) and \(Wh_j\). The right side shows multi-head attention, where node 1 attends to its neighbourhood through multiple independent attention heads.

Figure 1 shows how a node looks at its neighbours through multiple attention lenses.

13. Why GAT Improves Over GCN

GCN and GAT both aggregate neighbourhood information, but they differ in how neighbour importance is decided.

Aspect	GCN	GAT
Neighbour weighting	Mostly fixed by graph degree normalization	Learned through attention
Importance of neighbours	Similar structural treatment	Different learned importance
Interpretability	Limited	Attention weights can be inspected
Inductive use	Possible but less natural in original spectral framing	More naturally applicable
Need for expensive spectral operations	No in simplified GCN, but derived from spectral theory	No spectral operation needed

GAT’s main advantage is that it can learn:

\[ \text{Neighbour } j \text{ is more important than neighbour } k \]

for a particular node \(i\).

14. Computational Efficiency

The paper emphasizes that GAT does not require costly matrix operations such as graph Laplacian eigendecomposition.

A single attention head has time complexity:

\[ O(|V|FF' + |E|F') \]

Symbol	Meaning
\(\|V\|\)	Number of nodes
\(\|E\|\)	Number of edges
\(F\)	Input feature dimension
\(F'\)	Output feature dimension

This makes GAT suitable for graph-structured data without requiring expensive spectral computations.

15. Transductive and Inductive Learning

The paper evaluates GAT in both transductive and inductive settings.

Setting	Meaning
Transductive learning	The graph is known during training, but only some node labels are used
Inductive learning	The model must generalize to new graphs that were not seen during training

The paper tests GAT on:

Dataset	Task Type
Cora	Transductive
Citeseer	Transductive
Pubmed	Transductive
PPI	Inductive

In the PPI dataset, the test graphs are completely unseen during training. This shows that GAT is not limited to one fixed graph.

16. Dataset Details

The paper summarizes the datasets as follows:

Dataset	Nodes	Edges	Features per Node	Classes
Cora	2,708	5,429	1,433	7
Citeseer	3,327	4,732	3,703	6
Pubmed	19,717	44,338	500	3
PPI	56,944 across 24 graphs	818,716	50	121 multilabel

For citation datasets, nodes are papers and edges are citation links. For PPI, nodes are proteins and edges represent protein-protein interactions.

17. Results

The paper reports strong performance. For transductive node classification, GAT achieves:

Dataset	GCN	GAT
Cora	\(81.5\%\)	\(83.0 \pm 0.7\%\)
Citeseer	\(70.3\%\)	\(72.5 \pm 0.7\%\)
Pubmed	\(79.0\%\)	\(79.0 \pm 0.3\%\)

For the inductive PPI task, GAT achieves:

\[ 0.973 \pm 0.002 \]

micro-F1, compared with:

\[ 0.934 \pm 0.006 \]

for the constant-attention version and:

\[ 0.612 \]

for the best reported original GraphSAGE variant in the table.

The improvement on PPI is especially important because it shows that attention is very useful when generalizing to unseen graphs.

18. What Figure 2 Shows

Figure 2 on page 9 shows a t-SNE visualization of feature representations learned by the first hidden layer of a GAT model on the Cora dataset. The node colours represent classes, and the edge thickness represents the strength of aggregated attention coefficients.

Figure 2 shows that GAT learns a graph representation where nodes of the same class tend to group together.

19. Why Attention Gives Interpretability

One interesting advantage of GAT is that attention weights can be inspected. If node \(i\) gives high attention to node \(j\), then we can ask why that neighbour mattered.

For saree provenance work, this is important. A GAT could reveal whether the model is attending more to motif, border, weave, pallu, region, material, or similar saree examples.

For example:

\[ \alpha_{\text{saree, motif}} = 0.42 \]

\[ \alpha_{\text{saree, region}} = 0.31 \]

\[ \alpha_{\text{saree, colour}} = 0.05 \]

This would suggest that the model found motif and region more important than colour for that prediction.

20. Relation to GCN and GraphSAGE

GCN, GraphSAGE, and GAT are closely related, but each paper solves a different problem.

Paper	Core Idea	Main Limitation Addressed
GCN	Aggregate normalized neighbour information	Uses fixed structural weighting
GraphSAGE	Sample and aggregate neighbours for inductive learning	Handles unseen nodes
GAT	Learn attention weights over neighbours	Learns which neighbours matter more

GCN says “learn from neighbours,” GraphSAGE says “sample neighbours for unseen nodes,” and GAT says “learn which neighbours are important.”

21. Relevance to Saree Provenance Classification

GAT is highly relevant to saree provenance classification if image embeddings are combined with a structured textile knowledge graph.

Imagine a graph where nodes include:

Node Type	Example
Saree image	Image of a Kanjivaram saree
Craft cluster	Kanjivaram, Banarasi, Paithani
Region	Tamil Nadu, Varanasi, Maharashtra
Motif	Peacock, paisley, temple, floral
Technique	Brocade, ikat, jamdani
Material	Silk, cotton, zari
Border type	Korvai, temple border, zari border

Edges may include:

Edge Type	Meaning
saree-has-motif	A saree contains a motif
saree-from-region	A saree belongs to a region
craft-uses-technique	A craft uses weaving technique
saree-has-border	A saree has a border type
motif-associated-with-craft	A motif is common in a craft

For a new saree, GAT can learn which neighbouring nodes are important for classification.

For example, for one saree:

\[ \alpha_{\text{weave}} > \alpha_{\text{colour}} \]

For another saree:

\[ \alpha_{\text{motif}} > \alpha_{\text{material}} \]

This is useful because saree identity is not determined by one feature alone. Different craft clusters may require attention to different cues.

22. Main Limitations

The paper notes some limitations and future directions. First, practical implementation with sparse matrix operations can limit batching, especially for datasets with multiple graphs.

Second, the model’s receptive field is still limited by depth. A two-layer GAT only reaches two-hop neighbourhoods.

Third, the paper does not incorporate edge features directly. This matters because in many knowledge graphs, the edge type itself carries meaning. For example:

\[ \text{saree} \rightarrow \text{motif} \]

and:

\[ \text{saree} \rightarrow \text{region} \]

are not the same kind of relation. Later graph attention models and relational GNNs address this more directly.

23. One-Sentence Summary

Graph Attention Networks learn node representations by allowing each node to assign different attention weights to different neighbours, so the model can decide which neighbouring information is most important for prediction.

In very simple words, GAT lets every node listen more carefully to the neighbours that matter most.

```

My Research Notes

Thursday, 4 June 2026

Understanding the Graph Attention Networks Paper

Understanding the Graph Attention Networks Paper

1. The Problem This Paper Addresses

2. The Main Idea of Graph Attention Networks

3. Why Attention is Useful in Graphs

4. Input and Output of a GAT Layer

5. Linear Transformation of Node Features

6. Attention Score Between Two Nodes

7. Masked Attention

8. Softmax Normalization of Attention Scores

9. Updating the Node Representation

10. Simple Example

11. Multi-Head Attention

12. What Figure 1 Shows

13. Why GAT Improves Over GCN

14. Computational Efficiency

15. Transductive and Inductive Learning

16. Dataset Details

17. Results

18. What Figure 2 Shows

19. Why Attention Gives Interpretability

20. Relation to GCN and GraphSAGE

21. Relevance to Saree Provenance Classification

22. Main Limitations

23. One-Sentence Summary

No comments:

Post a Comment

Understading the Paper: Fine Grained Image Analysis with Deep Learning