๐Ÿค– This page was automatically generated by Claude Code.
Collective Communication · Distributed DL

The Big Send-off: High-Performance Collectives on GPU-based Supercomputers

PCCL โ€” a Performant Collective Communication Library that scales all-gather, reduce-scatter, and all-reduce to thousands of GPUs for distributed deep learning.

Siddharth Singh Keshav Pradeep Mahua Singh Cunyang Wei Abhinav Bhatele
Department of Computer Science, University of Maryland, College Park
168×
Reduce-scatter vs RCCL (2048 GCDs)
33×
All-gather vs RCCL
4.9×
ZeRO-3 training speedup
2048
GCDs evaluated on Frontier
01 / Overview

Abstract

Collective communication is becoming increasingly important in supercomputer workloads with the rise of AI jobs. Yet existing libraries โ€” NCCL, RCCL, and Cray-MPICH โ€” exhibit performance and scalability limitations on modern GPU supercomputers. We introduce the Performant Collective Communication Library (PCCL), targeted at distributed deep learning, with highly optimized implementations of all-gather, reduce-scatter, and all-reduce.

PCCL uses hierarchical algorithms and ML-guided adaptive dispatching to scale efficiently to thousands of GPUs. On 2048 GCDs of Frontier it achieves up to 168× (reduce-scatter), 33× (all-gather), and 10× (all-reduce) over RCCL; up to 5.7× on Perlmutter. These gains translate directly to end-to-end training: up to 4.9× over RCCL in DeepSpeed ZeRO-3, and up to 2.4× in PyTorch DDP.

collective communicationall-gatherreduce-scatterall-reducehierarchical collectivesdistributed deep learning
02 / TL;DR

Key Contributions

๐Ÿ”

Diagnose the libraries

A systematic analysis of Cray-MPICH, NCCL, and RCCL limitations for all-gather and reduce-scatter in DL workloads on Perlmutter and Frontier โ€” pinpointing NIC under-utilization, CPU-side reductions, and missing log-latency algorithms.

โš™๏ธ

Build PCCL

Optimized hierarchical implementations of all-gather, reduce-scatter, and all-reduce that fully use system NICs and GPU compute, scaling to large messages and GPU counts.

๐Ÿš€

Massive speedups

Up to 168× (reduce-scatter), 33× (all-gather), 10× (all-reduce) over RCCL on 2048 GCDs of Frontier; up to 5.7× on Perlmutter.

๐Ÿง 

End-to-end validation

Multi-billion-parameter LLM training: up to 4.9× over RCCL in DeepSpeed ZeRO-3 and up to 2.4× in PyTorch DDP.

All-gather scaling of RCCL, Cray-MPICH, NCCL
Figure 1. All-gather performance for 64 and 128 MB output buffers using RCCL (Frontier), Cray-MPICH (Frontier), and NCCL (Perlmutter). The ideal scaling curve is a flat horizontal line โ€” none of the libraries achieve it, revealing a large performance gap to close.
03 / Background

Why Collectives, Why Large Messages

Modern distributed training is dominated by three collectives. Their message sizes are far larger than traditional HPC โ€” tens to hundreds of MB, sometimes >1 GB โ€” exactly the regime where existing libraries struggle.

Sharded Data Parallelism

Parameters and gradients are sharded across GPUs (FSDP, ZeRO-3, AxoNN). All-gather reconstructs full parameters; reduce-scatter reduces and distributes gradients.

Distributed Data Parallelism

Parameters are replicated; all-reduce synchronizes gradients each iteration. A 1B-param FP32 model exchanges 4 GB of gradients per step.

The algorithmic toolbox

Ring is simple and bandwidth-efficient but its latency grows linearly with process count. Recursive halving/doubling needs only logโ‚‚(p) steps โ€” far better at scale.

Message size distribution across DL frameworks
Figure 2. All-gather and reduce-scatter message sizes across FSDP, DeepSpeed ZeRO-3, and AxoNN for a range of transformer sizes โ€” consistently tens to hundreds of MB, exceeding 1 GB for the largest models.
04 / Diagnosis

What's Wrong With Today's Libraries

We benchmarked all three libraries with best practices (NUMA-aware NIC binding, GPU Direct RDMA, no eager messaging) and found distinct, fixable bottlenecks.

Cray-MPICH wastes NICs and the GPU

Cray-MPICH vs RCCL all-gather NIC write packets NIC read packets
Figure 3. RCCL is ~4× faster than Cray-MPICH for bandwidth-bound all-gather (left). NIC counters reveal why: Cray-MPICH routes all writes through NIC-0 and reads through NIC-3 (middle, right), while RCCL spreads traffic evenly across all four NICs on the node.
Reduce-scatter CPU vs GPU compute
Figure 4. For reduce-scatter, Cray-MPICH (orange) performs reductions on the CPU and lags badly. A custom implementation using MPI point-to-point + a GPU vector-add kernel (blue) is several times faster โ€” confirming the CPU-reduction bottleneck.
Observation 1

Cray-MPICH severely underutilizes available network (NIC) and computational (GPU) resources. It routes all network traffic through a single NIC, and performs reduction operations on the CPU instead of offloading them to the GPU.

NCCL & RCCL scale poorly at large GPU counts

For all-gather and reduce-scatter, NCCL and RCCL only support the ring algorithm. Each process must send and receive (pโˆ’1) messages sequentially, so communication time grows linearly with process count โ€” crippling at scale.

Observation 2

NCCL and RCCL rely solely on the ring algorithm for all-gather and reduce-scatter, leading to poor scaling in latency-bound scenarios. More efficient algorithms such as recursive doubling and halving are not supported.

05 / Design

How PCCL Fixes It

PCCL combines a two-level hierarchical algorithm, custom inter-node implementations, and an ML-guided dispatcher that picks the best backend per configuration.

Hierarchical all-gather schematic
Figure 5. PCCL's two-level hierarchical all-gather on N nodes ร— M GPUs. Step 1: concurrent inter-node all-gathers (mapping each GCD to its own NIC to fill all four). Step 2: intra-node all-gather. Step 3: a device-local transpose to reorder data. Reduce-scatter mirrors this; all-reduce composes the two.
Backend A

PCCL_ring

  • Inter-node ring algorithm
  • Best for bandwidth-bound regimes (few processes, large messages)
  • Saturates peer-to-peer bandwidth
Backend B

PCCL_rec

  • Recursive doubling (AG) / halving (RS)
  • Best for latency-bound regimes (many processes, smaller messages)
  • GPU-side reduction kernels; logโ‚‚(p) steps
Recursive halving vs ring speedup heatmap
Figure 6. Speedup of recursive halving over ring for the inter-node phase of reduce-scatter. Ring wins in bandwidth-bound cells (top-left); recursive halving wins decisively in latency-bound cells (bottom-right) โ€” motivating adaptive selection.

ML-guided adaptive dispatching

No single backend wins everywhere. PCCL trains a lightweight SVM per (machine, collective) on message size and GPU count to pick the best of Cray-MPICH, NCCL/RCCL, PCCL_ring, and PCCL_rec at runtime.

PCCL adaptive dispatch architecture
Figure 7. PCCL's ML-guided selection mechanism chooses the best-performing backend from the available options for each call.
MachineCollectiveTest sizeCorrectAccuracy
FrontierAll-Gather201785.0%
Reduce-Scatter201890.0%
All-Reduce201680.0%
PerlmutterAll-Gather222090.9%
Reduce-Scatter222195.4%
All-Reduce201575.0%

SVM dispatcher accuracy on held-out test data (20%). High accuracy and low misclassification indicate the dispatcher generalizes to unseen configurations.

06 / Results

Collective Performance

Across both systems, PCCL maintains near-flat scaling where the baselines degrade โ€” the gap widens dramatically with GPU count.

Perlmutter (NVIDIA A100) โ€” vs NCCL & Cray-MPICH

All-gather Perlmutter Reduce-scatter Perlmutter All-reduce Perlmutter
Figure 8. All-gather (left), reduce-scatter (middle), all-reduce (right) on Perlmutter. PCCL scales nearly perfectly, achieving 1.3โ€“4.6× over NCCL and 8.8โ€“15× over Cray-MPICH at 1024โ€“2048 GPUs. NCCL and PCCL match for all-reduce (both use log-latency algorithms).
PCCL vs NCCL all-gather heatmap PCCL vs NCCL reduce-scatter heatmap PCCL vs NCCL all-reduce heatmap
Figure 9. PCCL speedup over NCCL across message size ร— process count (Perlmutter). In latency-bound cells (~1024โ€“2048 procs, 16โ€“32 MB), PCCL is 3โ€“5× faster; even at 2048 procs / 128โ€“512 MB it remains 2โ€“3× ahead.

Frontier (AMD MI250X) โ€” vs RCCL & Cray-MPICH

All-gather Frontier Reduce-scatter Frontier All-reduce Frontier
Figure 10. Collectives on Frontier. RCCL and Cray-MPICH scale almost linearly (worse) beyond 128 GCDs; PCCL stays near-flat. At 2048 GCDs, PCCL all-gather is 7โ€“24× over RCCL and 27โ€“82× over Cray-MPICH.
PCCL vs RCCL all-gather heatmap PCCL vs RCCL reduce-scatter heatmap PCCL vs RCCL all-reduce heatmap
Figure 11. PCCL speedup over RCCL on Frontier. In the latency-bound regime (2048 GCDs, 16โ€“64 MB), PCCL is >30× for all-gather and 50โ€“100× for reduce-scatter. RCCL shows 200× higher NIC match-overflow and 120× more rendezvous PUT traffic at scale โ€” direct evidence of its degradation.
07 / End-to-End

Real Training Speedups

Communication gains translate into faster large-model training as scale increases โ€” exactly where it matters.

ZeRO-3 strong scaling Frontier ZeRO-3 strong scaling Perlmutter
Figure 12. DeepSpeed ZeRO-3 strong scaling for GPT-3 7B and 13B on Frontier (left) and Perlmutter (right). On Frontier, RCCL fails to keep scaling past 512 GCDs while PCCL continues โ€” reaching 3.3โ€“4.9× speedups at 2048 GCDs. On Perlmutter, PCCL pulls ahead of NCCL as scale grows (1.37× at 2048 GPUs).
DDP strong scaling Frontier
Figure 13. PyTorch DDP strong scaling for GPT-3 1.3B on Frontier. RCCL leads at small scale, but PCCL surpasses it at high GCD counts โ€” 1.8× at 1024 and 2.4× at 2048 GCDs.

Bottom line

Existing collective libraries leave enormous performance on the table at scale: Cray-MPICH wastes NICs and the GPU, while NCCL/RCCL lack log-latency algorithms for all-gather and reduce-scatter. PCCL's hierarchical design plus ML-guided dispatching delivers 6โ€“160× collective speedups over RCCL on 2048 GCDs of Frontier, and up to 4.9× (ZeRO-3) and 2.4× (DDP) in real LLM training โ€” paving the way for scalable deep learning on next-generation GPU supercomputers.

08 / Cite

BibTeX

@inproceedings{singh2025pccl,
  title     = {The Big Send-off: High-Performance Collectives on
               GPU-based Supercomputers},
  author    = {Singh, Siddharth and Pradeep, Keshav and Singh, Mahua
               and Wei, Cunyang and Bhatele, Abhinav},
  year      = {2025},
  note      = {University of Maryland, College Park}
}