Understanding the GraphSAGE Paper
The paper “Inductive Representation Learning on Large Graphs” was written by William L. Hamilton, Rex Ying, and Jure Leskovec. It introduces GraphSAGE, which stands for Graph SAmple and aggreGatE.
The main idea of the paper is to learn a function that can generate embeddings for new, unseen nodes, instead of learning a separate embedding for every node in a fixed graph.
1. The Main Problem
Before GraphSAGE, many graph embedding methods such as DeepWalk, node2vec, LINE, and matrix-factorization methods were mostly transductive. This means they learned embeddings for nodes that were already present during training.
For example, if a graph contains nodes:
\[ v_1, v_2, v_3, \ldots, v_n \]
a transductive method learns a separate embedding for each node:
\[ z_{v_1}, z_{v_2}, z_{v_3}, \ldots, z_{v_n} \]
This works when the graph is fixed. However, in real life, graphs constantly change.
| Real-World Graph | New Nodes May Be |
|---|---|
| Reddit graph | New posts |
| YouTube graph | New users or videos |
| Citation graph | New research papers |
| Protein interaction graph | New proteins or new organisms |
| Product graph | New products |
The problem is simple:
What happens when a new node appears after training?
Traditional transductive methods cannot immediately generate a good embedding for that new node. They often need retraining or extra optimization. GraphSAGE solves this problem.
2. What GraphSAGE Does Differently
GraphSAGE does not learn one embedding per node. Instead, it learns an embedding-generating function.
In simple words, GraphSAGE learns how to create an embedding for a node by looking at that node’s features and the features of its neighbours.
So instead of storing:
\[ \text{node} \rightarrow \text{fixed embedding} \]
GraphSAGE learns:
\[ \text{node features + neighbourhood features} \rightarrow \text{embedding} \]
This is why GraphSAGE is called an inductive method. It can generalize to:
\[ \text{unseen nodes} \]
and even:
\[ \text{unseen graphs} \]
provided that similar kinds of node features are available.
3. The Meaning of Inductive Learning
The word inductive is very important in this paper.
A transductive method answers:
“Given this fixed graph, what are the embeddings of these known nodes?”
An inductive method answers:
“Can I learn a rule for generating embeddings, so that I can apply it to new nodes later?”
| Method Type | What It Learns | Can Handle Unseen Nodes Easily? |
|---|---|---|
| Transductive embedding | Embedding of each existing node | No |
| Inductive embedding | Function to generate embeddings | Yes |
GraphSAGE belongs to the second category.
4. The Core Idea: Sample and Aggregate
The name GraphSAGE comes from:
\[ \text{SAmple and aggreGatE} \]
Figure 1 of the paper gives the visual summary of the method. It shows three steps: first, GraphSAGE samples a node’s local neighbourhood; second, it aggregates feature information from neighbours; third, it uses the aggregated information to predict graph context or labels.
The process is:
| Step | What Happens |
|---|---|
| 1 | Pick a target node |
| 2 | Sample some neighbours |
| 3 | Collect their feature vectors |
| 4 | Aggregate neighbour information |
| 5 | Combine it with the target node’s own feature |
| 6 | Produce an embedding |
A node’s identity is partly determined by its own features and partly by the company it keeps.
5. The GraphSAGE Embedding Generation Algorithm
Let the initial feature vector of node \(v\) be:
\[ h_v^{0} = x_v \]
| Symbol | Meaning |
|---|---|
| \(v\) | A node |
| \(x_v\) | Original feature vector of node \(v\) |
| \(h_v^k\) | Representation of node \(v\) at layer or depth \(k\) |
| \(N(v)\) | Neighbourhood of node \(v\) |
| \(K\) | Number of aggregation layers or depths |
At each layer \(k\), GraphSAGE first aggregates information from neighbours:
\[ h_{N(v)}^{k} = \text{AGGREGATE}_{k} \left( \{h_u^{k-1}, \forall u \in N(v)\} \right) \]
Then it combines the node’s previous representation with the aggregated neighbour representation:
\[ h_v^{k} = \sigma \left( W^k \cdot \text{CONCAT} \left( h_v^{k-1}, h_{N(v)}^{k} \right) \right) \]
Then it normalizes the representation:
\[ h_v^{k} = \frac{h_v^{k}}{\|h_v^{k}\|_2} \]
After \(K\) layers, the final embedding is:
\[ z_v = h_v^{K} \]
This means the final embedding of a node contains information from its multi-hop neighbourhood.
6. One-Hop and Two-Hop Understanding
If:
\[ K = 1 \]
then the node receives information from its immediate neighbours.
If:
\[ K = 2 \]
then the node receives information from neighbours and neighbours-of-neighbours.
So GraphSAGE gradually expands the receptive field of a node:
\[ \text{node} \rightarrow \text{neighbours} \rightarrow \text{neighbours of neighbours} \]
The paper reports that \(K = 2\) works well in practice. Increasing \(K\) beyond 2 gives only small gains but greatly increases runtime.
7. Why Sampling is Needed
In large graphs, a node may have thousands or millions of neighbours. Aggregating all neighbours would be expensive. So GraphSAGE samples a fixed number of neighbours.
For example, instead of using all neighbours, it may sample:
\[ S_1 = 25 \]
neighbours at the first depth and:
\[ S_2 = 10 \]
neighbours at the second depth.
This keeps computation manageable. Without sampling, the neighbourhood can explode. With sampling, the computation remains controlled.
GraphSAGE does not try to listen to everyone. It listens to a fixed sample of neighbours.
8. Aggregator Functions
The aggregator is the heart of GraphSAGE. Since a node’s neighbours are an unordered set, the aggregator should ideally work regardless of the order in which neighbours are listed.
The paper studies three main aggregator types.
| Aggregator | Main Idea |
|---|---|
| Mean aggregator | Take the average of neighbour vectors |
| LSTM aggregator | Process neighbour vectors through an LSTM |
| Pooling aggregator | Transform neighbour vectors and apply max-pooling |
9. Mean Aggregator
The mean aggregator simply averages the neighbour representations:
\[ h_{N(v)}^{k} = \text{MEAN} \left( \{h_u^{k-1}, \forall u \in N(v)\} \right) \]
This is simple and efficient. The paper notes that this is close to the idea of Graph Convolutional Networks, but GraphSAGE makes it inductive and uses sampled neighbourhoods.
10. LSTM Aggregator
The LSTM aggregator is more expressive because an LSTM can learn a more complex function over inputs.
However, there is one issue. LSTMs are sequence models, and node neighbours do not have a natural order. So the paper applies the LSTM to a random permutation of neighbours.
This makes it more powerful than mean aggregation, but also less naturally suited to unordered graph neighbourhoods.
11. Pooling Aggregator
The pooling aggregator first applies a neural network to each neighbour vector:
\[ \sigma(W_{\text{pool}}h_{u}^{k} + b) \]
Then it applies element-wise max pooling:
\[ \text{AGGREGATE}_{k}^{\text{pool}} = \max \left( \{\sigma(W_{\text{pool}}h_{u}^{k} + b), \forall u_i \in N(v)\} \right) \]
This lets the model detect important features from the neighbourhood.
The pooling aggregator asks: “What is the strongest signal present among this node’s neighbours?”
The paper finds that the pooling and LSTM aggregators often perform better than the simple mean aggregator.
12. Supervised and Unsupervised Learning
GraphSAGE can be trained in two ways. First, it can be trained in a supervised way using labels, such as node categories:
\[ \text{node embedding} \rightarrow \text{class label} \]
Second, it can be trained in an unsupervised way using graph context. Nearby nodes should have similar embeddings, while unrelated nodes should have different embeddings.
The paper uses an unsupervised loss:
\[ J_G(z_u) = -\log(\sigma(z_u^T z_v)) - Q \cdot \mathbb{E}_{v_n \sim P_n(v)} \log(\sigma(-z_u^T z_{v_n})) \]
| Symbol | Meaning |
|---|---|
| \(z_u\) | Embedding of node \(u\) |
| \(z_v\) | Embedding of a nearby or context node |
| \(z_{v_n}\) | Embedding of a negative sample |
| \(Q\) | Number of negative samples |
| \(\sigma\) | Sigmoid function |
The intuition is:
\[ \text{nearby nodes} \Rightarrow \text{similar embeddings} \]
\[ \text{negative sampled nodes} \Rightarrow \text{different embeddings} \]
13. How GraphSAGE Differs from GCN
The previous paper, Kipf and Welling’s GCN, is mainly transductive. It assumes the graph is available during training. GraphSAGE extends this idea into an inductive setting.
| GCN | GraphSAGE |
|---|---|
| Usually transductive | Inductive |
| Uses full adjacency matrix | Samples neighbourhoods |
| Learns using fixed graph structure | Learns aggregation functions |
| Harder to apply directly to unseen nodes | Designed for unseen nodes |
| Uses normalized graph convolution | Uses sample-and-aggregate framework |
GCN learns on a graph. GraphSAGE learns how to generate embeddings from neighbourhoods.
14. Experiments
The paper tests GraphSAGE on three inductive node classification tasks.
| Dataset | Task |
|---|---|
| Citation graph | Predict paper subject categories |
| Reddit graph | Predict subreddit or community of posts |
| Protein-protein interaction graphs | Predict protein functions |
The key point is that the test nodes are unseen during training. In the protein-protein interaction experiment, the model is tested on entirely unseen graphs. This is a strong test of inductive learning.
15. Results
The paper reports that GraphSAGE performs better than strong baselines such as raw features, DeepWalk, and DeepWalk plus features.
Table 1 reports micro-F1 scores across Citation, Reddit, and PPI datasets. The supervised GraphSAGE variants perform strongly. Examples include:
| GraphSAGE Variant | Dataset | Reported Supervised F1 |
|---|---|---|
| GraphSAGE-pool | Citation | \(0.839\) |
| GraphSAGE-LSTM | \(0.954\) | |
| GraphSAGE-LSTM | PPI | \(0.612\) |
The table also shows that GraphSAGE improves over raw features by:
\[ 39\% \text{ to } 63\% \]
depending on dataset and training mode.
16. Runtime Advantage
DeepWalk becomes slow for unseen nodes because it must generate new random walks and perform optimization again. GraphSAGE is much faster at inference because it simply applies the learned aggregator functions.
The paper reports that DeepWalk can be:
\[ 100\times \text{ to } 500\times \]
slower at test time for unseen nodes.
This is a major practical advantage of GraphSAGE.
17. Why GraphSAGE is Useful
GraphSAGE is useful because it solves a real deployment problem. Many graph systems are dynamic, and new nodes appear all the time.
| Domain | New Node |
|---|---|
| Social media | New user or post |
| E-commerce | New product or customer |
| Citation database | New research paper |
| Biology | New protein or organism |
| Textile knowledge graph | New saree image or motif |
GraphSAGE can produce embeddings for these new nodes without retraining the entire graph embedding model.
18. Relation to Weisfeiler-Lehman
The paper connects GraphSAGE to the Weisfeiler-Lehman graph isomorphism test. The Weisfeiler-Lehman method repeatedly updates a node’s representation by aggregating information from neighbouring nodes.
GraphSAGE does something similar, but with trainable neural aggregators. So we can think of GraphSAGE as:
A neural, trainable version of neighbourhood aggregation.
This is similar to the appendix idea in the GCN paper, but GraphSAGE emphasizes inductive embedding generation.
19. Relevance to Saree Provenance Classification
This paper is very relevant for saree provenance classification if one builds a graph-based saree knowledge system.
Imagine a graph with nodes such as:
| Node Type | Example |
|---|---|
| Saree image | Image of a Kanjivaram saree |
| Motif | Peacock, mango, temple border |
| Region | Tamil Nadu, Varanasi, Telangana |
| Weave technique | Ikat, brocade, jamdani |
| Material | Silk, cotton, zari |
| Craft cluster | Kanjivaram, Banarasi, Gadwal |
Edges could represent:
| Edge | Meaning |
|---|---|
| saree-has-motif | A saree contains a motif |
| saree-from-region | A saree belongs to a region |
| craft-uses-technique | A craft uses a technique |
| motif-associated-with-region | A motif is linked to a region |
Now suppose a new saree image enters the system. A transductive graph method may struggle because the new node was not present during training. GraphSAGE is useful because it can generate an embedding for this new saree node using:
\[ \text{its visual features} \]
and:
\[ \text{its graph neighbourhood} \]
For example:
\[ z_{\text{new saree}} = \text{GraphSAGE} ( \text{image features}, \text{motif neighbours}, \text{region neighbours}, \text{weave neighbours} ) \]
This makes GraphSAGE attractive for a live saree classification or recommendation system where new products keep coming.
20. Difference from the Previous GCN Paper
The GCN paper gave the foundation:
How can a node learn from its neighbours?
The GraphSAGE paper asks the next practical question:
How can we do this for new nodes and large evolving graphs?
| Question | Paper |
|---|---|
| How to perform graph convolution for semi-supervised node classification? | GCN |
| How to generate embeddings for unseen nodes in large graphs? | GraphSAGE |
21. Main Limitation
GraphSAGE depends heavily on the quality of node features. If a new node has poor or missing features, the generated embedding may also be weak.
Neighbour sampling also introduces approximation. It improves scalability but may miss important neighbours. Another limitation is that the paper mainly studies fixed-size uniform neighbour sampling. The authors themselves mention that learning better non-uniform sampling strategies could be a future direction.
22. One-Sentence Summary
GraphSAGE is an inductive graph representation learning method that learns how to generate node embeddings by sampling and aggregating feature information from a node’s local neighbourhood.
In very simple words, GraphSAGE teaches the model how to create an embedding for a new node by looking at the node itself and the neighbours around it.
No comments:
Post a Comment