Graph MLPractical Guide

Graph Neural Networks:
When and Why (2026)

Not everything is a graph problem, and not every graph problem needs a GNN. This guide helps you decide when graph structure genuinely helps -- and which architecture to reach for when it does.

March 2026|20 min read|5 architectures, 3 benchmarks, code included

TL;DR

  • Use GNNs when: Your data has meaningful relational structure that a flat table or sequence throws away (molecules, social graphs, knowledge graphs, meshes)
  • Skip GNNs when: The graph is fully connected (just use a Transformer), the graph is trivially small, or you only care about node features
  • Start with: GCN for homogeneous node classification, GAT when neighbor importance varies, GraphSAGE for large/inductive graphs, GIN for graph-level tasks, GPS for long-range dependencies
  • Library: PyTorch Geometric (PyG) dominates. DGL is the alternative. Both have OGB integration

When You Need a GNN (and When You Don't)

GNNs shine when...

  • +Structure is informative: a molecule's 3D connectivity determines its properties, not just its atom list
  • +Topology varies: different samples have different graph structures (molecular graphs, scene graphs)
  • +Relational reasoning matters: predicting interactions, influence propagation, link formation
  • +Inductive generalization: you need to work on unseen graphs at inference (drug candidates, new users)
  • +Sparse connections: each node connects to few neighbors relative to graph size

Skip GNNs when...

  • -Data is tabular: XGBoost still beats GNNs on feature-rich tabular data without meaningful relations
  • -Graph is fully connected: if every node connects to every other, you've reinvented attention -- just use a Transformer
  • -Edges are artificial: constructing a KNN graph from features rarely beats an MLP on the raw features
  • -Sequential structure dominates: for text/time series, sequence models capture the relevant inductive bias better
  • -Scale is tiny: under 100 nodes, classical methods (spectral clustering, label propagation) often suffice

The Decision Flowchart

1. Is your data naturally a graph (molecules, networks, meshes)?

No --> Probably don't need a GNN. Try tabular/sequence models first.

Yes --> Continue.

2. Does graph topology vary across samples?

No, single fixed graph --> Consider spectral methods or simple label propagation.

Yes, or large single graph --> GNN is a strong fit.

3. Do you need to generalize to unseen graphs?

No, transductive only --> GCN or spectral approaches work fine.

Yes, inductive --> GraphSAGE, GIN, or GPS.

4. Are long-range dependencies critical?

No, local context enough --> GCN or GAT (2-3 layers).

Yes --> GPS (Graph Transformer) or deep GNN with skip connections.

Five Architectures, Simply Explained

GCN

Kipf & Welling, 2017

Graph Convolutional Network

One-liner: Each node averages its neighbors' features (weighted by degree), then applies a linear transform.

Intuition: Think of it as a CNN where the "kernel" slides over a node's neighborhood instead of a pixel grid. The symmetric normalization (dividing by the square root of both nodes' degrees) prevents high-degree nodes from dominating.

+ Simple, fast, well-understood+ Strong baseline for node classification- All neighbors weighted equally- Transductive (fixed graph at train time)

GAT

Velickovic et al., 2018

Graph Attention Network

One-liner: Like GCN, but learns attention weights for each edge -- some neighbors matter more than others.

Intuition: In a citation network, not all papers that cite yours are equally relevant. GAT learns a small neural network that scores each edge, then uses softmax to turn scores into weights. Multi-head attention (like Transformers) stabilizes training.

+ Adaptive, interpretable attention+ Works well on heterogeneous graphs- More memory than GCN (stores attention scores)- GATv2 fixes expressiveness issues in original

GraphSAGE

Hamilton et al., 2017

Sample and Aggregate

One-liner: Samples a fixed-size neighborhood, aggregates features, and concatenates with the node's own embedding.

Intuition: The key insight is sampling. Instead of using the full neighborhood (impossible for billion-node graphs), you sample K neighbors at each layer. This enables mini-batch training and, critically, inductive learning -- the model generalizes to nodes and graphs it has never seen.

+ Scales to millions of nodes+ Inductive (works on unseen graphs)+ Mini-batch training- Sampling introduces variance- Aggregator choice matters (mean vs LSTM vs pool)

GIN

Xu et al., 2019

Graph Isomorphism Network

One-liner: The most expressive message-passing GNN -- provably as powerful as the 1-WL graph isomorphism test.

Intuition: GCN and GraphSAGE use mean/max aggregation, which can't distinguish certain graph structures. GIN uses sum aggregation with an MLP, which preserves multiset information. If two graphs have different local structures, GIN will map them to different embeddings. This makes it the go-to choice for graph classification.

+ Maximally expressive (within MPNN framework)+ Best for graph-level tasks- Still limited by 1-WL expressiveness ceiling- Sum aggregation can be unstable on large neighborhoods

GPS

Rampasek et al., 2022

General, Powerful, Scalable Graph Transformer

One-liner: Combines local message passing (MPNN) with global self-attention in each layer, plus positional/structural encodings to inject graph topology.

Intuition: Standard GNNs suffer from over-squashing: information from distant nodes gets bottlenecked through intermediate nodes. GPS fixes this by adding a Transformer-style global attention path alongside the local GNN path. Positional encodings (Laplacian eigenvectors, random walk statistics) give the model awareness of graph structure that vanilla attention lacks.

+ Long-range dependencies+ SOTA on many OGB benchmarks+ Flexible: any MPNN + any attention- O(n^2) attention for global component- More hyperparameters to tune

Benchmark Results (OGB, 2026)

Numbers from the Open Graph Benchmark leaderboards. These are realistic, reproducible results -- not cherry-picked.

ogbg-molpcba

Graph classification: predict 128 bioassay labels for ~438K molecules. Metric: AP (Average Precision).

ModelTypeTest APParams
GPS + virtual nodeGraph Transformer0.3212~6.2M
GIN + virtual nodeMPNN0.2921~3.4M
GCN + virtual nodeMPNN0.2724~2.0M
GIN (no VN)MPNN0.2703~1.9M
GCN (no VN)MPNN0.2424~1.5M

Virtual nodes add a global node connected to all others -- a simple way to enable long-range message passing without full attention.

ogbn-arxiv

Node classification: predict subject area of ~170K arXiv CS papers from citation graph. Metric: Accuracy.

ModelTypeTest AccuracyLayers
GIANT-XRT + RevGATGAT + LM features77.36%3
GAT + label propagationGAT + post-processing74.15%3
GraphSAGEMPNN (inductive)71.49%3
GCNMPNN71.74%3
MLP (no graph)Baseline55.50%3

The MLP baseline shows how much graph structure matters here: +16 points from adding edges.

ogbl-collab

Link prediction: predict future collaborations between ~235K authors. Metric: Hits@50.

ModelTypeTest Hits@50
S2GAE + SEALGNN + subgraph66.79%
BUDDYGNN + hashing65.94%
GraphSAGE + edge featuresMPNN54.63%
Common NeighborsHeuristic44.75%

Link prediction is where GNNs combined with structural features (subgraph patterns, node degrees) dominate simple heuristics.

Real-World Applications

D

Drug Discovery

Molecules are graphs. Atoms are nodes, bonds are edges. GNNs predict molecular properties (toxicity, binding affinity, solubility) directly from structure.

Models: GIN, SchNet, DimeNet++ for 3D, GPS for long-range interactions

Scale: Screen millions of candidates in hours vs months in wet lab

Used by: Recursion, Insilico Medicine, Relay Therapeutics

S

Social Networks

User-user and user-item interactions form massive graphs. GNNs power recommendation, community detection, and influence prediction.

Models: PinSage (Pinterest), GraphSAGE at scale

Scale: Billions of nodes, neighbor sampling is essential

Used by: Pinterest, Twitter/X, LinkedIn, Snap

F

Fraud Detection

Fraudsters form rings -- accounts connected by shared devices, IPs, payment methods. GNNs detect these patterns that tabular models miss entirely.

Models: Heterogeneous GNNs (R-GCN), temporal GNNs

Key insight: Fraud is a graph problem. Individual transactions look normal; the network structure is anomalous

Used by: PayPal, Stripe, Amazon

R

Recommendations

User-item bipartite graphs capture collaborative filtering signals. GNNs propagate preferences through the graph, surfacing items liked by similar users.

Models: LightGCN, PinSage, NGCF

Advantage: Naturally handles cold-start through graph connectivity

Used by: Pinterest, Uber Eats, Kuaishou

Code Examples (PyTorch Geometric)

All examples use PyG (PyTorch Geometric), the most widely used GNN library. Install: pip install torch-geometric

gcn_cora.py -- Node classification with GCN
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid

# Load Cora citation network
dataset = Planetoid(root='data/', name='Cora')
data = dataset[0]

class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden)
        self.conv2 = GCNConv(hidden, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

model = GCN(dataset.num_features, 64, dataset.num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

# Training loop
for epoch in range(200):
    model.train()
    optimizer.zero_grad()
    out = model(data.x, data.edge_index)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

# Evaluate
model.eval()
pred = model(data.x, data.edge_index).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f"Test accuracy: {acc:.4f}")  # ~81.5%
gat.py -- Graph Attention Network
from torch_geometric.nn import GATConv

class GAT(torch.nn.Module):
    """Graph Attention Network - learns WHICH neighbors matter."""
    def __init__(self, in_channels, hidden, out_channels, heads=8):
        super().__init__()
        self.conv1 = GATConv(in_channels, hidden, heads=heads, dropout=0.6)
        self.conv2 = GATConv(hidden * heads, out_channels, heads=1,
                             concat=False, dropout=0.6)

    def forward(self, x, edge_index):
        x = F.dropout(x, p=0.6, training=self.training)
        x = F.elu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.6, training=self.training)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

# GAT typically hits ~83.0% on Cora (vs ~81.5% for GCN)
# The attention weights are interpretable - you can visualize them
graphsage_minibatch.py -- Scalable inductive learning
from torch_geometric.nn import SAGEConv
from torch_geometric.loader import NeighborLoader

class GraphSAGE(torch.nn.Module):
    """Inductive learning - works on unseen nodes/graphs."""
    def __init__(self, in_channels, hidden, out_channels):
        super().__init__()
        self.conv1 = SAGEConv(in_channels, hidden)
        self.conv2 = SAGEConv(hidden, out_channels)

    def forward(self, x, edge_index):
        x = F.relu(self.conv1(x, edge_index))
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        return x

# Mini-batch training for large graphs (millions of nodes)
loader = NeighborLoader(
    data,
    num_neighbors=[25, 10],  # Sample 25 1-hop, 10 2-hop neighbors
    batch_size=1024,
    input_nodes=data.train_mask,
)

for batch in loader:
    out = model(batch.x, batch.edge_index)
    loss = F.cross_entropy(out[:batch.batch_size], batch.y[:batch.batch_size])
    loss.backward()
    optimizer.step()
gin_molecules.py -- Graph-level classification with GIN
from torch_geometric.nn import GINConv, global_add_pool
from torch_geometric.datasets import MoleculeNet

class GIN(torch.nn.Module):
    """Graph Isomorphism Network - maximally expressive message passing."""
    def __init__(self, in_channels, hidden, out_channels, num_layers=5):
        super().__init__()
        self.convs = torch.nn.ModuleList()
        self.bns = torch.nn.ModuleList()

        for i in range(num_layers):
            dim_in = in_channels if i == 0 else hidden
            mlp = torch.nn.Sequential(
                torch.nn.Linear(dim_in, hidden),
                torch.nn.ReLU(),
                torch.nn.Linear(hidden, hidden),
            )
            self.convs.append(GINConv(mlp))
            self.bns.append(torch.nn.BatchNorm1d(hidden))

        self.classifier = torch.nn.Linear(hidden, out_channels)

    def forward(self, x, edge_index, batch):
        for conv, bn in zip(self.convs, self.bns):
            x = F.relu(bn(conv(x, edge_index)))
        # Global pooling: graph-level readout
        x = global_add_pool(x, batch)
        return self.classifier(x)

# Molecular property prediction
dataset = MoleculeNet(root='data/', name='ogbg-molpcba')

GNNs vs Transformers: A Nuanced View

The "Transformers will replace GNNs" narrative is overly simplistic. They solve different problems, and the most powerful architectures (GPS, Graphormer) combine both.

DimensionMessage-Passing GNNsGraph Transformers
ComplexityO(|E|) -- linear in edgesO(|V|^2) -- quadratic in nodes
Receptive fieldK-hop (K = num layers)Global (all nodes in one layer)
Structure awarenessBuilt-in (edges define computation)Needs positional encodings
ScalabilityMillions of nodes with samplingLimited to ~10K nodes (without sparse attention)
Over-squashingMajor issue for deep GNNsAvoided (global attention)
Best forLarge sparse graphs, local patternsSmall/medium graphs, long-range deps

The pragmatic take (2026)

  • 1.Small molecular graphs (<100 nodes): GPS or Graphormer wins. The global attention cost is negligible and long-range interactions matter for property prediction.
  • 2.Large social/citation graphs (100K+ nodes): GraphSAGE or GCN with sampling. Full attention is computationally impossible.
  • 3.Medium graphs with varying structure: GAT gives you the best interpretability/performance tradeoff.
  • 4.Graph-level classification: GIN as the strong MPNN baseline, GPS if you need SOTA.

Concepts You'll Actually Need

Message Passing

Every GNN follows the same pattern: for each node, (1) collect messages from neighbors, (2) aggregate them (sum, mean, max, attention), (3) update the node's embedding. One round = one "layer" = one hop of information. Stack K layers to see K hops away. The key design choice is the aggregation function -- it determines expressiveness.

Over-smoothing

Stack too many layers and all node embeddings converge to the same vector. Each layer averages over neighborhoods, so after ~6 layers every node has "seen" most of the graph and embeddings become indistinguishable. Practical fix: use 2-3 layers for most tasks. For deeper GNNs, add skip connections, normalization, or use GPS.

Over-squashing

Related but different from over-smoothing. Information from distant nodes must pass through bottleneck nodes, causing exponential compression. This is why GNNs struggle with long-range dependencies on tree-like graphs. Solutions: virtual nodes, graph rewiring, or Graph Transformers.

Positional & Structural Encodings

Unlike sequences, graphs have no canonical node ordering. Positional encodings (Laplacian eigenvectors, random walk probabilities) give nodes a sense of "where" they are in the graph. Structural encodings (degree, centrality, subgraph counts) capture "what role" a node plays. Both are critical for Graph Transformers and increasingly used with MPNNs.

Related Resources