Graph Neural Networks:
When and Why (2026)
Not everything is a graph problem, and not every graph problem needs a GNN. This guide helps you decide when graph structure genuinely helps -- and which architecture to reach for when it does.
TL;DR
- Use GNNs when: Your data has meaningful relational structure that a flat table or sequence throws away (molecules, social graphs, knowledge graphs, meshes)
- Skip GNNs when: The graph is fully connected (just use a Transformer), the graph is trivially small, or you only care about node features
- Start with: GCN for homogeneous node classification, GAT when neighbor importance varies, GraphSAGE for large/inductive graphs, GIN for graph-level tasks, GPS for long-range dependencies
- Library: PyTorch Geometric (PyG) dominates. DGL is the alternative. Both have OGB integration
When You Need a GNN (and When You Don't)
GNNs shine when...
- +Structure is informative: a molecule's 3D connectivity determines its properties, not just its atom list
- +Topology varies: different samples have different graph structures (molecular graphs, scene graphs)
- +Relational reasoning matters: predicting interactions, influence propagation, link formation
- +Inductive generalization: you need to work on unseen graphs at inference (drug candidates, new users)
- +Sparse connections: each node connects to few neighbors relative to graph size
Skip GNNs when...
- -Data is tabular: XGBoost still beats GNNs on feature-rich tabular data without meaningful relations
- -Graph is fully connected: if every node connects to every other, you've reinvented attention -- just use a Transformer
- -Edges are artificial: constructing a KNN graph from features rarely beats an MLP on the raw features
- -Sequential structure dominates: for text/time series, sequence models capture the relevant inductive bias better
- -Scale is tiny: under 100 nodes, classical methods (spectral clustering, label propagation) often suffice
The Decision Flowchart
1. Is your data naturally a graph (molecules, networks, meshes)?
No --> Probably don't need a GNN. Try tabular/sequence models first.
Yes --> Continue.
2. Does graph topology vary across samples?
No, single fixed graph --> Consider spectral methods or simple label propagation.
Yes, or large single graph --> GNN is a strong fit.
3. Do you need to generalize to unseen graphs?
No, transductive only --> GCN or spectral approaches work fine.
Yes, inductive --> GraphSAGE, GIN, or GPS.
4. Are long-range dependencies critical?
No, local context enough --> GCN or GAT (2-3 layers).
Yes --> GPS (Graph Transformer) or deep GNN with skip connections.
Five Architectures, Simply Explained
GCN
Kipf & Welling, 2017Graph Convolutional Network
One-liner: Each node averages its neighbors' features (weighted by degree), then applies a linear transform.
Intuition: Think of it as a CNN where the "kernel" slides over a node's neighborhood instead of a pixel grid. The symmetric normalization (dividing by the square root of both nodes' degrees) prevents high-degree nodes from dominating.
GAT
Velickovic et al., 2018Graph Attention Network
One-liner: Like GCN, but learns attention weights for each edge -- some neighbors matter more than others.
Intuition: In a citation network, not all papers that cite yours are equally relevant. GAT learns a small neural network that scores each edge, then uses softmax to turn scores into weights. Multi-head attention (like Transformers) stabilizes training.
GraphSAGE
Hamilton et al., 2017Sample and Aggregate
One-liner: Samples a fixed-size neighborhood, aggregates features, and concatenates with the node's own embedding.
Intuition: The key insight is sampling. Instead of using the full neighborhood (impossible for billion-node graphs), you sample K neighbors at each layer. This enables mini-batch training and, critically, inductive learning -- the model generalizes to nodes and graphs it has never seen.
GIN
Xu et al., 2019Graph Isomorphism Network
One-liner: The most expressive message-passing GNN -- provably as powerful as the 1-WL graph isomorphism test.
Intuition: GCN and GraphSAGE use mean/max aggregation, which can't distinguish certain graph structures. GIN uses sum aggregation with an MLP, which preserves multiset information. If two graphs have different local structures, GIN will map them to different embeddings. This makes it the go-to choice for graph classification.
GPS
Rampasek et al., 2022General, Powerful, Scalable Graph Transformer
One-liner: Combines local message passing (MPNN) with global self-attention in each layer, plus positional/structural encodings to inject graph topology.
Intuition: Standard GNNs suffer from over-squashing: information from distant nodes gets bottlenecked through intermediate nodes. GPS fixes this by adding a Transformer-style global attention path alongside the local GNN path. Positional encodings (Laplacian eigenvectors, random walk statistics) give the model awareness of graph structure that vanilla attention lacks.
Benchmark Results (OGB, 2026)
Numbers from the Open Graph Benchmark leaderboards. These are realistic, reproducible results -- not cherry-picked.
ogbg-molpcba
Graph classification: predict 128 bioassay labels for ~438K molecules. Metric: AP (Average Precision).
| Model | Type | Test AP | Params |
|---|---|---|---|
| GPS + virtual node | Graph Transformer | 0.3212 | ~6.2M |
| GIN + virtual node | MPNN | 0.2921 | ~3.4M |
| GCN + virtual node | MPNN | 0.2724 | ~2.0M |
| GIN (no VN) | MPNN | 0.2703 | ~1.9M |
| GCN (no VN) | MPNN | 0.2424 | ~1.5M |
Virtual nodes add a global node connected to all others -- a simple way to enable long-range message passing without full attention.
ogbn-arxiv
Node classification: predict subject area of ~170K arXiv CS papers from citation graph. Metric: Accuracy.
| Model | Type | Test Accuracy | Layers |
|---|---|---|---|
| GIANT-XRT + RevGAT | GAT + LM features | 77.36% | 3 |
| GAT + label propagation | GAT + post-processing | 74.15% | 3 |
| GraphSAGE | MPNN (inductive) | 71.49% | 3 |
| GCN | MPNN | 71.74% | 3 |
| MLP (no graph) | Baseline | 55.50% | 3 |
The MLP baseline shows how much graph structure matters here: +16 points from adding edges.
ogbl-collab
Link prediction: predict future collaborations between ~235K authors. Metric: Hits@50.
| Model | Type | Test Hits@50 |
|---|---|---|
| S2GAE + SEAL | GNN + subgraph | 66.79% |
| BUDDY | GNN + hashing | 65.94% |
| GraphSAGE + edge features | MPNN | 54.63% |
| Common Neighbors | Heuristic | 44.75% |
Link prediction is where GNNs combined with structural features (subgraph patterns, node degrees) dominate simple heuristics.
Real-World Applications
Drug Discovery
Molecules are graphs. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, binding affinity, solubility) directly from structure.
Models: GIN, SchNet, DimeNet++ for 3D, GPS for long-range interactions
Scale: Screen millions of candidates in hours vs months in wet lab
Used by: Recursion, Insilico Medicine, Relay Therapeutics
Social Networks
User-user and user-item interactions form massive graphs. GNNs power recommendation, community detection, and influence prediction.
Models: PinSage (Pinterest), GraphSAGE at scale
Scale: Billions of nodes, neighbor sampling is essential
Used by: Pinterest, Twitter/X, LinkedIn, Snap
Fraud Detection
Fraudsters form rings -- accounts connected by shared devices, IPs, payment methods. GNNs detect these patterns that tabular models miss entirely.
Models: Heterogeneous GNNs (R-GCN), temporal GNNs
Key insight: Fraud is a graph problem. Individual transactions look normal; the network structure is anomalous
Used by: PayPal, Stripe, Amazon
Recommendations
User-item bipartite graphs capture collaborative filtering signals. GNNs propagate preferences through the graph, surfacing items liked by similar users.
Models: LightGCN, PinSage, NGCF
Advantage: Naturally handles cold-start through graph connectivity
Used by: Pinterest, Uber Eats, Kuaishou
Code Examples (PyTorch Geometric)
All examples use PyG (PyTorch Geometric), the most widely used GNN library. Install: pip install torch-geometric
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid
# Load Cora citation network
dataset = Planetoid(root='data/', name='Cora')
data = dataset[0]
class GCN(torch.nn.Module):
def __init__(self, in_channels, hidden, out_channels):
super().__init__()
self.conv1 = GCNConv(in_channels, hidden)
self.conv2 = GCNConv(hidden, out_channels)
def forward(self, x, edge_index):
x = self.conv1(x, edge_index)
x = F.relu(x)
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
model = GCN(dataset.num_features, 64, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
# Training loop
for epoch in range(200):
model.train()
optimizer.zero_grad()
out = model(data.x, data.edge_index)
loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
loss.backward()
optimizer.step()
# Evaluate
model.eval()
pred = model(data.x, data.edge_index).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f"Test accuracy: {acc:.4f}") # ~81.5%from torch_geometric.nn import GATConv
class GAT(torch.nn.Module):
"""Graph Attention Network - learns WHICH neighbors matter."""
def __init__(self, in_channels, hidden, out_channels, heads=8):
super().__init__()
self.conv1 = GATConv(in_channels, hidden, heads=heads, dropout=0.6)
self.conv2 = GATConv(hidden * heads, out_channels, heads=1,
concat=False, dropout=0.6)
def forward(self, x, edge_index):
x = F.dropout(x, p=0.6, training=self.training)
x = F.elu(self.conv1(x, edge_index))
x = F.dropout(x, p=0.6, training=self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
# GAT typically hits ~83.0% on Cora (vs ~81.5% for GCN)
# The attention weights are interpretable - you can visualize themfrom torch_geometric.nn import SAGEConv
from torch_geometric.loader import NeighborLoader
class GraphSAGE(torch.nn.Module):
"""Inductive learning - works on unseen nodes/graphs."""
def __init__(self, in_channels, hidden, out_channels):
super().__init__()
self.conv1 = SAGEConv(in_channels, hidden)
self.conv2 = SAGEConv(hidden, out_channels)
def forward(self, x, edge_index):
x = F.relu(self.conv1(x, edge_index))
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv2(x, edge_index)
return x
# Mini-batch training for large graphs (millions of nodes)
loader = NeighborLoader(
data,
num_neighbors=[25, 10], # Sample 25 1-hop, 10 2-hop neighbors
batch_size=1024,
input_nodes=data.train_mask,
)
for batch in loader:
out = model(batch.x, batch.edge_index)
loss = F.cross_entropy(out[:batch.batch_size], batch.y[:batch.batch_size])
loss.backward()
optimizer.step()from torch_geometric.nn import GINConv, global_add_pool
from torch_geometric.datasets import MoleculeNet
class GIN(torch.nn.Module):
"""Graph Isomorphism Network - maximally expressive message passing."""
def __init__(self, in_channels, hidden, out_channels, num_layers=5):
super().__init__()
self.convs = torch.nn.ModuleList()
self.bns = torch.nn.ModuleList()
for i in range(num_layers):
dim_in = in_channels if i == 0 else hidden
mlp = torch.nn.Sequential(
torch.nn.Linear(dim_in, hidden),
torch.nn.ReLU(),
torch.nn.Linear(hidden, hidden),
)
self.convs.append(GINConv(mlp))
self.bns.append(torch.nn.BatchNorm1d(hidden))
self.classifier = torch.nn.Linear(hidden, out_channels)
def forward(self, x, edge_index, batch):
for conv, bn in zip(self.convs, self.bns):
x = F.relu(bn(conv(x, edge_index)))
# Global pooling: graph-level readout
x = global_add_pool(x, batch)
return self.classifier(x)
# Molecular property prediction
dataset = MoleculeNet(root='data/', name='ogbg-molpcba')GNNs vs Transformers: A Nuanced View
The "Transformers will replace GNNs" narrative is overly simplistic. They solve different problems, and the most powerful architectures (GPS, Graphormer) combine both.
| Dimension | Message-Passing GNNs | Graph Transformers |
|---|---|---|
| Complexity | O(|E|) -- linear in edges | O(|V|^2) -- quadratic in nodes |
| Receptive field | K-hop (K = num layers) | Global (all nodes in one layer) |
| Structure awareness | Built-in (edges define computation) | Needs positional encodings |
| Scalability | Millions of nodes with sampling | Limited to ~10K nodes (without sparse attention) |
| Over-squashing | Major issue for deep GNNs | Avoided (global attention) |
| Best for | Large sparse graphs, local patterns | Small/medium graphs, long-range deps |
The pragmatic take (2026)
- 1.Small molecular graphs (<100 nodes): GPS or Graphormer wins. The global attention cost is negligible and long-range interactions matter for property prediction.
- 2.Large social/citation graphs (100K+ nodes): GraphSAGE or GCN with sampling. Full attention is computationally impossible.
- 3.Medium graphs with varying structure: GAT gives you the best interpretability/performance tradeoff.
- 4.Graph-level classification: GIN as the strong MPNN baseline, GPS if you need SOTA.
Concepts You'll Actually Need
Message Passing
Every GNN follows the same pattern: for each node, (1) collect messages from neighbors, (2) aggregate them (sum, mean, max, attention), (3) update the node's embedding. One round = one "layer" = one hop of information. Stack K layers to see K hops away. The key design choice is the aggregation function -- it determines expressiveness.
Over-smoothing
Stack too many layers and all node embeddings converge to the same vector. Each layer averages over neighborhoods, so after ~6 layers every node has "seen" most of the graph and embeddings become indistinguishable. Practical fix: use 2-3 layers for most tasks. For deeper GNNs, add skip connections, normalization, or use GPS.
Over-squashing
Related but different from over-smoothing. Information from distant nodes must pass through bottleneck nodes, causing exponential compression. This is why GNNs struggle with long-range dependencies on tree-like graphs. Solutions: virtual nodes, graph rewiring, or Graph Transformers.
Positional & Structural Encodings
Unlike sequences, graphs have no canonical node ordering. Positional encodings (Laplacian eigenvectors, random walk probabilities) give nodes a sense of "where" they are in the graph. Structural encodings (degree, centrality, subgraph counts) capture "what role" a node plays. Both are critical for Graph Transformers and increasingly used with MPNNs.