๐Ÿค– This page was automatically generated by Claude Code.
SC 2025 · Full-graph GNN Training

Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training

A three-dimensional parallel framework that scales full-graph GNN training to billion-edge graphs โ€” no sampling, no approximations โ€” across thousands of GPUs.

Aditya K. Ranjan Siddharth Singh Cunyang Wei Abhinav Bhatele
Department of Computer Science, University of Maryland, College Park
12.5×
Speedup over prior SOTA
54.2×
Time-to-solution cut (Frontier)
2048
GPUs on Perlmutter
1.6B edges
Largest graph trained
01 / Overview

Abstract

Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU, and training GNNs on them requires techniques such as mini-batch sampling to scale. The alternative โ€” distributed full-graph training โ€” suffers from high communication overheads and load imbalance due to the irregular structure of graphs.

We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. We introduce a double-permutation scheme for load balancing and a performance model to predict the optimal 3D configuration of our implementation, Plexus. Across six graph datasets, Plexus scales to up to 2048 GPUs of Perlmutter and 1024 GPUs of Frontier, achieving unprecedented speedups of 2.3โ€“12.5× over prior state of the art and reducing time-to-solution by 5.2โ€“8.7× on Perlmutter and 7.0โ€“54.2× on Frontier.

graph neural networksfull-graph training3D tensor parallelismSpMMGPGPUsload balancing
02 / TL;DR

Key Contributions

๐ŸงŠ

3D parallel full-graph GNN

An open-source framework that adapts Agarwal's 3D parallel matrix-multiply to full-graph GNN training, distributing every matrix across a 3D GPU grid and parallelizing all the matrix multiplications involved.

๐Ÿ“Š

Performance model

A model that identifies the optimal arrangement of GPUs in the 3D virtual grid โ€” no exhaustive search of configurations needed.

โš–๏ธ

Load-balancing optimizations

A double-permutation scheme drives non-zero imbalance from 7.70 down to 1.001 (max/mean), plus blocked aggregation to cut performance variability and dense-GEMM tuning to scale on Frontier.

๐Ÿ†

Record-scale results

Scales to 1024 GPUs on Frontier and 2048 on Perlmutter โ€” the largest-scale full-graph GNN training reported to date โ€” with 2.3โ€“12.5× speedups over SOTA.

03 / The Challenge

Why Full-graph Training is Hard

Mini-batch sampling is the default in PyG and DGL โ€” but sampling introduces approximation error, neighborhood explosion, and CPUโ€“GPU transfer bottlenecks. Full-graph training avoids all of that, at the cost of three hard systems problems.

โ‘  Memory & communication

Billion-edge graphs must be distributed across GPUs, forcing large activation/gradient syncs. Training quickly becomes communication-bound.

โ‘ก Sparse SpMM

Aggregation is Sparse Matrix-Matrix Multiplication โ€” irregular memory access, low reuse, and poor GPU utilization. Adjacency matrices here are 99.79โ€“99.99% zeros.

โ‘ข Load imbalance

Uneven sparsity means some GPUs do far more work, creating stragglers that ripple through an epoch and inflate communication time too.

04 / Method

3D Tensor Parallelism for GNNs

Plexus arranges G GPUs into a 3D virtual grid (G = Gx ร— Gy ร— Gz) and shards the adjacency, feature, and weight matrices across different planes โ€” parallelizing every SpMM and GEMM in the forward and backward pass.

3D tensor parallel method overview
Figure 1. The 3D tensor-parallel GCN layer: the sparse adjacency A is sharded on the ZX-plane, input features on the XY-plane (further sharded on Z), and weights on the YX-plane. Each step pairs a matrix multiply (SpMM / SGEMM) with a collective (all-gather / all-reduce) along one grid dimension.
Matrix shard shapes
Figure 2. Shapes of the matrix shards on a single GPU for layer 0, showing the two key matrix multiplications in the forward pass.
Performance model
Figure 3. Plexus's performance model predicts communication and computation cost across 3D configurations, selecting the optimal Gxร—Gyร—Gz mapping without exhaustive testing.
Applying 3D parallelism to all layers
Figure 4. Applying the 3D algorithm across all layers of a 3-layer GCN. Plexus cycles through just three unique adjacency-matrix shards (ZX, YZ, XY planes) so each layer's output feeds the next without extra reshuffling communication.
05 / Optimizations

Making 3D Parallelism Fast

Double permutation for load balancing

Uneven sparsity creates computational stragglers. Rather than an expensive graph partitioner (which must re-partition per GPU count), Plexus applies distinct row and column permutations as a one-time preprocessing step โ€” disrupting tightly-coupled communities and driving non-zeros to a near-perfect distribution.

Permutation methodMax / Mean non-zeros (8ร—8 shards)
Original7.70
Single permutation3.24
Double permutation (this work)1.001

europe_osm dataset. Double permutation achieves near-perfect balance at the cost of storing two adjacency shards โ€” a reasonable trade-off given GCNs typically use only 2โ€“4 layers.

Blocked aggregation & dense-GEMM tuning

Blocked aggregation impact Dense matmul tuning impact
Figure 5. Left: blocking the SpMM into row-blocks with per-block all-reduce mitigates forward-pass variability on Isolate-3-8M (16/32 GPUs, Perlmutter) and cuts communication as a side effect. Right: reordering the transposed dense GEMM in the weight-gradient calc drops it from ~50 ms to negligible, enabling Plexus to scale to 1024 GCDs on Frontier (products-14M).
Also under the hood

A parallel data loader shards data into 2D files offline so each GPU loads only what it needs. For ogbn-papers100M on 64 GPUs, CPU memory dropped from 146 GB โ†’ 9 GB and load time from 139s โ†’ 7s.

06 / Datasets

Six Graphs, up to 1.6 Billion Edges

DatasetNodesEdgesNon-zerosFeaturesClasses
Reddit232,96557.3M114.8M60241
ogbn-products2.45M61.9M126.2M10047
Isolate-3-8M8.75M654.6M1.32B12832
products-14M14.25M115.4M245.0M12832
europe_osm50.91M54.1M159.0M12832
ogbn-papers100M111.06M1.62B1.73B100172

A 3-layer GCN with hidden dimension 128. Plexus is validated against a serial PyTorch Geometric baseline for correctness.

Validation against PyTorch Geometric
Figure 6. Validating Plexus against a serial PyTorch Geometric baseline (16 GPUs, Perlmutter, ogbn-products) โ€” matching loss curves confirm correctness.
07 / Results

Scaling & Speedups

vs. state-of-the-art frameworks

Scaling comparison Reddit Scaling comparison Isolate-3-8M Scaling comparison products-14M
Figure 7. Strong scaling of Plexus vs. SA, SA+GVB, and BNS-GCN on Perlmutter. Plexus is the only framework that keeps scaling โ€” 6× over BNS-GCN at 32 GPUs and 9× over SA at 128 GPUs (Reddit); 3.8× over BNS-GCN at 256 GPUs (Isolate-3-8M); 2.3× over SA at 128 GPUs (products-14M). SA/SA+GVB hit OOM on the larger graphs.
Epoch time breakdown
Figure 8. Epoch-time breakdown for BNS-GCN vs. Plexus (products-14M, 32โ€“256 GPUs, Perlmutter). BNS-GCN's fine-grained partition parallelism wins at 32 GPUs, but its all-to-all communication and growing boundary-node count cause it to fall apart by 64 GPUs โ€” where Plexus's neighbor-only ring collectives take over.

Strong scaling of Plexus to 2048 GPUs

Plexus strong scaling on Perlmutter and Frontier
Figure 9. Plexus strong scaling across all six datasets on Perlmutter (left) and Frontier (right). Graph density sets the communication-to-computation ratio: denser graphs scale better. Plexus reaches ogbn-papers100M on 2048 GPUs of Perlmutter โ€” the largest full-graph GNN training to date. Frontier shows even better trends, as MI250X SpMM times leave more room to scale.

Bottom line

Full-graph GNN training has long been considered impractical to scale, forcing reliance on lossy sampling. Plexus shows that a 3D tensor-parallel approach โ€” with double-permutation load balancing and a configuration-selecting performance model โ€” makes exact, full-graph training both practical and fast, scaling billion-edge graphs to 2048 GPUs with 2.3โ€“12.5× speedups over prior state of the art, all without a graph partitioner.

08 / Cite

BibTeX

@inproceedings{ranjan2025plexus,
  title     = {Plexus: Taming Billion-edge Graphs with 3D Parallel
               Full-graph GNN Training},
  author    = {Ranjan, Aditya K. and Singh, Siddharth and
               Wei, Cunyang and Bhatele, Abhinav},
  booktitle = {Proceedings of the International Conference for High
               Performance Computing, Networking, Storage and
               Analysis (SC '25)},
  year      = {2025},
  doi       = {10.1145/3712285.3759890}
}