Thursday, 4 June 2026

Understanding the GraphSAGE Paper

Understanding the GraphSAGE Paper

The paper “Inductive Representation Learning on Large Graphs” was written by William L. Hamilton, Rex Ying, and Jure Leskovec. It introduces GraphSAGE, which stands for Graph SAmple and aggreGatE.

The main idea of the paper is to learn a function that can generate embeddings for new, unseen nodes, instead of learning a separate embedding for every node in a fixed graph.

1. The Main Problem

Before GraphSAGE, many graph embedding methods such as DeepWalk, node2vec, LINE, and matrix-factorization methods were mostly transductive. This means they learned embeddings for nodes that were already present during training.

For example, if a graph contains nodes:

\[ v_1, v_2, v_3, \ldots, v_n \]

a transductive method learns a separate embedding for each node:

\[ z_{v_1}, z_{v_2}, z_{v_3}, \ldots, z_{v_n} \]

This works when the graph is fixed. However, in real life, graphs constantly change.

Real-World Graph New Nodes May Be
Reddit graph New posts
YouTube graph New users or videos
Citation graph New research papers
Protein interaction graph New proteins or new organisms
Product graph New products

The problem is simple:

What happens when a new node appears after training?

Traditional transductive methods cannot immediately generate a good embedding for that new node. They often need retraining or extra optimization. GraphSAGE solves this problem.

2. What GraphSAGE Does Differently

GraphSAGE does not learn one embedding per node. Instead, it learns an embedding-generating function.

In simple words, GraphSAGE learns how to create an embedding for a node by looking at that node’s features and the features of its neighbours.

So instead of storing:

\[ \text{node} \rightarrow \text{fixed embedding} \]

GraphSAGE learns:

\[ \text{node features + neighbourhood features} \rightarrow \text{embedding} \]

This is why GraphSAGE is called an inductive method. It can generalize to:

\[ \text{unseen nodes} \]

and even:

\[ \text{unseen graphs} \]

provided that similar kinds of node features are available.

3. The Meaning of Inductive Learning

The word inductive is very important in this paper.

A transductive method answers:

“Given this fixed graph, what are the embeddings of these known nodes?”

An inductive method answers:

“Can I learn a rule for generating embeddings, so that I can apply it to new nodes later?”

Method Type What It Learns Can Handle Unseen Nodes Easily?
Transductive embedding Embedding of each existing node No
Inductive embedding Function to generate embeddings Yes

GraphSAGE belongs to the second category.

4. The Core Idea: Sample and Aggregate

The name GraphSAGE comes from:

\[ \text{SAmple and aggreGatE} \]

Figure 1 of the paper gives the visual summary of the method. It shows three steps: first, GraphSAGE samples a node’s local neighbourhood; second, it aggregates feature information from neighbours; third, it uses the aggregated information to predict graph context or labels.

The process is:

Step What Happens
1 Pick a target node
2 Sample some neighbours
3 Collect their feature vectors
4 Aggregate neighbour information
5 Combine it with the target node’s own feature
6 Produce an embedding

A node’s identity is partly determined by its own features and partly by the company it keeps.

5. The GraphSAGE Embedding Generation Algorithm

Let the initial feature vector of node \(v\) be:

\[ h_v^{0} = x_v \]

Symbol Meaning
\(v\) A node
\(x_v\) Original feature vector of node \(v\)
\(h_v^k\) Representation of node \(v\) at layer or depth \(k\)
\(N(v)\) Neighbourhood of node \(v\)
\(K\) Number of aggregation layers or depths

At each layer \(k\), GraphSAGE first aggregates information from neighbours:

\[ h_{N(v)}^{k} = \text{AGGREGATE}_{k} \left( \{h_u^{k-1}, \forall u \in N(v)\} \right) \]

Then it combines the node’s previous representation with the aggregated neighbour representation:

\[ h_v^{k} = \sigma \left( W^k \cdot \text{CONCAT} \left( h_v^{k-1}, h_{N(v)}^{k} \right) \right) \]

Then it normalizes the representation:

\[ h_v^{k} = \frac{h_v^{k}}{\|h_v^{k}\|_2} \]

After \(K\) layers, the final embedding is:

\[ z_v = h_v^{K} \]

This means the final embedding of a node contains information from its multi-hop neighbourhood.

6. One-Hop and Two-Hop Understanding

If:

\[ K = 1 \]

then the node receives information from its immediate neighbours.

If:

\[ K = 2 \]

then the node receives information from neighbours and neighbours-of-neighbours.

So GraphSAGE gradually expands the receptive field of a node:

\[ \text{node} \rightarrow \text{neighbours} \rightarrow \text{neighbours of neighbours} \]

The paper reports that \(K = 2\) works well in practice. Increasing \(K\) beyond 2 gives only small gains but greatly increases runtime.

7. Why Sampling is Needed

In large graphs, a node may have thousands or millions of neighbours. Aggregating all neighbours would be expensive. So GraphSAGE samples a fixed number of neighbours.

For example, instead of using all neighbours, it may sample:

\[ S_1 = 25 \]

neighbours at the first depth and:

\[ S_2 = 10 \]

neighbours at the second depth.

This keeps computation manageable. Without sampling, the neighbourhood can explode. With sampling, the computation remains controlled.

GraphSAGE does not try to listen to everyone. It listens to a fixed sample of neighbours.

8. Aggregator Functions

The aggregator is the heart of GraphSAGE. Since a node’s neighbours are an unordered set, the aggregator should ideally work regardless of the order in which neighbours are listed.

The paper studies three main aggregator types.

Aggregator Main Idea
Mean aggregator Take the average of neighbour vectors
LSTM aggregator Process neighbour vectors through an LSTM
Pooling aggregator Transform neighbour vectors and apply max-pooling

9. Mean Aggregator

The mean aggregator simply averages the neighbour representations:

\[ h_{N(v)}^{k} = \text{MEAN} \left( \{h_u^{k-1}, \forall u \in N(v)\} \right) \]

This is simple and efficient. The paper notes that this is close to the idea of Graph Convolutional Networks, but GraphSAGE makes it inductive and uses sampled neighbourhoods.

10. LSTM Aggregator

The LSTM aggregator is more expressive because an LSTM can learn a more complex function over inputs.

However, there is one issue. LSTMs are sequence models, and node neighbours do not have a natural order. So the paper applies the LSTM to a random permutation of neighbours.

This makes it more powerful than mean aggregation, but also less naturally suited to unordered graph neighbourhoods.

11. Pooling Aggregator

The pooling aggregator first applies a neural network to each neighbour vector:

\[ \sigma(W_{\text{pool}}h_{u}^{k} + b) \]

Then it applies element-wise max pooling:

\[ \text{AGGREGATE}_{k}^{\text{pool}} = \max \left( \{\sigma(W_{\text{pool}}h_{u}^{k} + b), \forall u_i \in N(v)\} \right) \]

This lets the model detect important features from the neighbourhood.

The pooling aggregator asks: “What is the strongest signal present among this node’s neighbours?”

The paper finds that the pooling and LSTM aggregators often perform better than the simple mean aggregator.

12. Supervised and Unsupervised Learning

GraphSAGE can be trained in two ways. First, it can be trained in a supervised way using labels, such as node categories:

\[ \text{node embedding} \rightarrow \text{class label} \]

Second, it can be trained in an unsupervised way using graph context. Nearby nodes should have similar embeddings, while unrelated nodes should have different embeddings.

The paper uses an unsupervised loss:

\[ J_G(z_u) = -\log(\sigma(z_u^T z_v)) - Q \cdot \mathbb{E}_{v_n \sim P_n(v)} \log(\sigma(-z_u^T z_{v_n})) \]

Symbol Meaning
\(z_u\) Embedding of node \(u\)
\(z_v\) Embedding of a nearby or context node
\(z_{v_n}\) Embedding of a negative sample
\(Q\) Number of negative samples
\(\sigma\) Sigmoid function

The intuition is:

\[ \text{nearby nodes} \Rightarrow \text{similar embeddings} \]

\[ \text{negative sampled nodes} \Rightarrow \text{different embeddings} \]

13. How GraphSAGE Differs from GCN

The previous paper, Kipf and Welling’s GCN, is mainly transductive. It assumes the graph is available during training. GraphSAGE extends this idea into an inductive setting.

GCN GraphSAGE
Usually transductive Inductive
Uses full adjacency matrix Samples neighbourhoods
Learns using fixed graph structure Learns aggregation functions
Harder to apply directly to unseen nodes Designed for unseen nodes
Uses normalized graph convolution Uses sample-and-aggregate framework

GCN learns on a graph. GraphSAGE learns how to generate embeddings from neighbourhoods.

14. Experiments

The paper tests GraphSAGE on three inductive node classification tasks.

Dataset Task
Citation graph Predict paper subject categories
Reddit graph Predict subreddit or community of posts
Protein-protein interaction graphs Predict protein functions

The key point is that the test nodes are unseen during training. In the protein-protein interaction experiment, the model is tested on entirely unseen graphs. This is a strong test of inductive learning.

15. Results

The paper reports that GraphSAGE performs better than strong baselines such as raw features, DeepWalk, and DeepWalk plus features.

Table 1 reports micro-F1 scores across Citation, Reddit, and PPI datasets. The supervised GraphSAGE variants perform strongly. Examples include:

GraphSAGE Variant Dataset Reported Supervised F1
GraphSAGE-pool Citation \(0.839\)
GraphSAGE-LSTM Reddit \(0.954\)
GraphSAGE-LSTM PPI \(0.612\)

The table also shows that GraphSAGE improves over raw features by:

\[ 39\% \text{ to } 63\% \]

depending on dataset and training mode.

16. Runtime Advantage

DeepWalk becomes slow for unseen nodes because it must generate new random walks and perform optimization again. GraphSAGE is much faster at inference because it simply applies the learned aggregator functions.

The paper reports that DeepWalk can be:

\[ 100\times \text{ to } 500\times \]

slower at test time for unseen nodes.

This is a major practical advantage of GraphSAGE.

17. Why GraphSAGE is Useful

GraphSAGE is useful because it solves a real deployment problem. Many graph systems are dynamic, and new nodes appear all the time.

Domain New Node
Social media New user or post
E-commerce New product or customer
Citation database New research paper
Biology New protein or organism
Textile knowledge graph New saree image or motif

GraphSAGE can produce embeddings for these new nodes without retraining the entire graph embedding model.

18. Relation to Weisfeiler-Lehman

The paper connects GraphSAGE to the Weisfeiler-Lehman graph isomorphism test. The Weisfeiler-Lehman method repeatedly updates a node’s representation by aggregating information from neighbouring nodes.

GraphSAGE does something similar, but with trainable neural aggregators. So we can think of GraphSAGE as:

A neural, trainable version of neighbourhood aggregation.

This is similar to the appendix idea in the GCN paper, but GraphSAGE emphasizes inductive embedding generation.

19. Relevance to Saree Provenance Classification

This paper is very relevant for saree provenance classification if one builds a graph-based saree knowledge system.

Imagine a graph with nodes such as:

Node Type Example
Saree image Image of a Kanjivaram saree
Motif Peacock, mango, temple border
Region Tamil Nadu, Varanasi, Telangana
Weave technique Ikat, brocade, jamdani
Material Silk, cotton, zari
Craft cluster Kanjivaram, Banarasi, Gadwal

Edges could represent:

Edge Meaning
saree-has-motif A saree contains a motif
saree-from-region A saree belongs to a region
craft-uses-technique A craft uses a technique
motif-associated-with-region A motif is linked to a region

Now suppose a new saree image enters the system. A transductive graph method may struggle because the new node was not present during training. GraphSAGE is useful because it can generate an embedding for this new saree node using:

\[ \text{its visual features} \]

and:

\[ \text{its graph neighbourhood} \]

For example:

\[ z_{\text{new saree}} = \text{GraphSAGE} ( \text{image features}, \text{motif neighbours}, \text{region neighbours}, \text{weave neighbours} ) \]

This makes GraphSAGE attractive for a live saree classification or recommendation system where new products keep coming.

20. Difference from the Previous GCN Paper

The GCN paper gave the foundation:

How can a node learn from its neighbours?

The GraphSAGE paper asks the next practical question:

How can we do this for new nodes and large evolving graphs?

Question Paper
How to perform graph convolution for semi-supervised node classification? GCN
How to generate embeddings for unseen nodes in large graphs? GraphSAGE

21. Main Limitation

GraphSAGE depends heavily on the quality of node features. If a new node has poor or missing features, the generated embedding may also be weak.

Neighbour sampling also introduces approximation. It improves scalability but may miss important neighbours. Another limitation is that the paper mainly studies fixed-size uniform neighbour sampling. The authors themselves mention that learning better non-uniform sampling strategies could be a future direction.

22. One-Sentence Summary

GraphSAGE is an inductive graph representation learning method that learns how to generate node embeddings by sampling and aggregating feature information from a node’s local neighbourhood.

In very simple words, GraphSAGE teaches the model how to create an embedding for a new node by looking at the node itself and the neighbours around it.

```

No comments:

Post a Comment

Understanding the Paper: Drishtikon

DRISHTIKON: A Multimodal Multilingual Benchmark for Indian Cultural Understanding The paper “DRISHTIKON: A Multimodal Multilingual Benchm...