The Big Send-off: High-Performance Collectives on GPU-based Supercomputers

01 / Overview

Abstract

Collective communication is becoming increasingly important in supercomputer workloads with the rise of AI jobs. Yet existing libraries — NCCL, RCCL, and Cray-MPICH — exhibit performance and scalability limitations on modern GPU supercomputers. We introduce the Performant Collective Communication Library (PCCL), targeted at distributed deep learning, with highly optimized implementations of all-gather, reduce-scatter, and all-reduce.

PCCL uses hierarchical algorithms and ML-guided adaptive dispatching to scale efficiently to thousands of GPUs. On 2048 GCDs of Frontier it achieves up to 168× (reduce-scatter), 33× (all-gather), and 10× (all-reduce) over RCCL; up to 5.7× on Perlmutter. These gains translate directly to end-to-end training: up to 4.9× over RCCL in DeepSpeed ZeRO-3, and up to 2.4× in PyTorch DDP.

collective communicationall-gatherreduce-scatterall-reducehierarchical collectivesdistributed deep learning

02 / TL;DR

Key Contributions

🔍

Diagnose the libraries

A systematic analysis of Cray-MPICH, NCCL, and RCCL limitations for all-gather and reduce-scatter in DL workloads on Perlmutter and Frontier — pinpointing NIC under-utilization, CPU-side reductions, and missing log-latency algorithms.

⚙️

Build PCCL

Optimized hierarchical implementations of all-gather, reduce-scatter, and all-reduce that fully use system NICs and GPU compute, scaling to large messages and GPU counts.

🚀

Massive speedups

Up to 168× (reduce-scatter), 33× (all-gather), 10× (all-reduce) over RCCL on 2048 GCDs of Frontier; up to 5.7× on Perlmutter.

🧠

End-to-end validation

Multi-billion-parameter LLM training: up to 4.9× over RCCL in DeepSpeed ZeRO-3 and up to 2.4× in PyTorch DDP.

All-gather scaling of RCCL, Cray-MPICH, NCCL — **Figure 1.** All-gather performance for 64 and 128 MB output buffers using RCCL (Frontier), Cray-MPICH (Frontier), and NCCL (Perlmutter). The ideal scaling curve is a flat horizontal line — none of the libraries achieve it, revealing a large performance gap to close.

03 / Background

Why Collectives, Why Large Messages

Modern distributed training is dominated by three collectives. Their message sizes are far larger than traditional HPC — tens to hundreds of MB, sometimes >1 GB — exactly the regime where existing libraries struggle.

Sharded Data Parallelism

Parameters and gradients are sharded across GPUs (FSDP, ZeRO-3, AxoNN). All-gather reconstructs full parameters; reduce-scatter reduces and distributes gradients.

Distributed Data Parallelism

Parameters are replicated; all-reduce synchronizes gradients each iteration. A 1B-param FP32 model exchanges 4 GB of gradients per step.

The algorithmic toolbox

Ring is simple and bandwidth-efficient but its latency grows linearly with process count. Recursive halving/doubling needs only log₂(p) steps — far better at scale.

Message size distribution across DL frameworks — **Figure 2.** All-gather and reduce-scatter message sizes across FSDP, DeepSpeed ZeRO-3, and AxoNN for a range of transformer sizes — consistently tens to hundreds of MB, exceeding 1 GB for the largest models.

04 / Diagnosis

What's Wrong With Today's Libraries

We benchmarked all three libraries with best practices (NUMA-aware NIC binding, GPU Direct RDMA, no eager messaging) and found distinct, fixable bottlenecks.

Cray-MPICH wastes NICs and the GPU

Cray-MPICH vs RCCL all-gather — **Figure 3.** RCCL is ~4× faster than Cray-MPICH for bandwidth-bound all-gather (left). NIC counters reveal why: Cray-MPICH routes *all* writes through NIC-0 and reads through NIC-3 (middle, right), while RCCL spreads traffic evenly across all four NICs on the node.

NIC write packets — **Figure 3.** RCCL is ~4× faster than Cray-MPICH for bandwidth-bound all-gather (left). NIC counters reveal why: Cray-MPICH routes *all* writes through NIC-0 and reads through NIC-3 (middle, right), while RCCL spreads traffic evenly across all four NICs on the node.

Reduce-scatter CPU vs GPU compute — **Figure 4.** For reduce-scatter, Cray-MPICH (orange) performs reductions on the CPU and lags badly. A custom implementation using MPI point-to-point + a GPU vector-add kernel (blue) is several times faster — confirming the CPU-reduction bottleneck.

Observation 1

Cray-MPICH severely underutilizes available network (NIC) and computational (GPU) resources. It routes all network traffic through a single NIC, and performs reduction operations on the CPU instead of offloading them to the GPU.

NCCL & RCCL scale poorly at large GPU counts

For all-gather and reduce-scatter, NCCL and RCCL only support the ring algorithm. Each process must send and receive (p−1) messages sequentially, so communication time grows linearly with process count — crippling at scale.

Observation 2

NCCL and RCCL rely solely on the ring algorithm for all-gather and reduce-scatter, leading to poor scaling in latency-bound scenarios. More efficient algorithms such as recursive doubling and halving are not supported.

05 / Design

How PCCL Fixes It

PCCL combines a two-level hierarchical algorithm, custom inter-node implementations, and an ML-guided dispatcher that picks the best backend per configuration.

Hierarchical all-gather schematic — **Figure 5.** PCCL's two-level hierarchical all-gather on N nodes × M GPUs. Step 1: concurrent inter-node all-gathers (mapping each GCD to its own NIC to fill all four). Step 2: intra-node all-gather. Step 3: a device-local transpose to reorder data. Reduce-scatter mirrors this; all-reduce composes the two.

Backend A

PCCL_ring

Inter-node ring algorithm
Best for bandwidth-bound regimes (few processes, large messages)
Saturates peer-to-peer bandwidth

Backend B

PCCL_rec

Recursive doubling (AG) / halving (RS)
Best for latency-bound regimes (many processes, smaller messages)
GPU-side reduction kernels; log₂(p) steps

Recursive halving vs ring speedup heatmap — **Figure 6.** Speedup of recursive halving over ring for the inter-node phase of reduce-scatter. Ring wins in bandwidth-bound cells (top-left); recursive halving wins decisively in latency-bound cells (bottom-right) — motivating adaptive selection.

ML-guided adaptive dispatching

No single backend wins everywhere. PCCL trains a lightweight SVM per (machine, collective) on message size and GPU count to pick the best of Cray-MPICH, NCCL/RCCL, PCCL_ring, and PCCL_rec at runtime.

PCCL adaptive dispatch architecture — **Figure 7.** PCCL's ML-guided selection mechanism chooses the best-performing backend from the available options for each call.

Machine	Collective	Test size	Correct	Accuracy
Frontier	All-Gather	20	17	85.0%
	Reduce-Scatter	20	18	90.0%
	All-Reduce	20	16	80.0%
Perlmutter	All-Gather	22	20	90.9%
	Reduce-Scatter	22	21	95.4%
	All-Reduce	20	15	75.0%

SVM dispatcher accuracy on held-out test data (20%). High accuracy and low misclassification indicate the dispatcher generalizes to unseen configurations.

06 / Results

Collective Performance

Across both systems, PCCL maintains near-flat scaling where the baselines degrade — the gap widens dramatically with GPU count.

Perlmutter (NVIDIA A100) — vs NCCL & Cray-MPICH

All-gather Perlmutter — **Figure 8.** All-gather (left), reduce-scatter (middle), all-reduce (right) on Perlmutter. PCCL scales nearly perfectly, achieving 1.3–4.6× over NCCL and 8.8–15× over Cray-MPICH at 1024–2048 GPUs. NCCL and PCCL match for all-reduce (both use log-latency algorithms).

Reduce-scatter Perlmutter — **Figure 8.** All-gather (left), reduce-scatter (middle), all-reduce (right) on Perlmutter. PCCL scales nearly perfectly, achieving 1.3–4.6× over NCCL and 8.8–15× over Cray-MPICH at 1024–2048 GPUs. NCCL and PCCL match for all-reduce (both use log-latency algorithms).

PCCL vs NCCL all-gather heatmap — **Figure 9.** PCCL speedup over NCCL across message size × process count (Perlmutter). In latency-bound cells (~1024–2048 procs, 16–32 MB), PCCL is 3–5× faster; even at 2048 procs / 128–512 MB it remains 2–3× ahead.

PCCL vs NCCL reduce-scatter heatmap — **Figure 9.** PCCL speedup over NCCL across message size × process count (Perlmutter). In latency-bound cells (~1024–2048 procs, 16–32 MB), PCCL is 3–5× faster; even at 2048 procs / 128–512 MB it remains 2–3× ahead.

Frontier (AMD MI250X) — vs RCCL & Cray-MPICH

All-gather Frontier — **Figure 10.** Collectives on Frontier. RCCL and Cray-MPICH scale almost linearly (worse) beyond 128 GCDs; PCCL stays near-flat. At 2048 GCDs, PCCL all-gather is 7–24× over RCCL and 27–82× over Cray-MPICH.

Reduce-scatter Frontier — **Figure 10.** Collectives on Frontier. RCCL and Cray-MPICH scale almost linearly (worse) beyond 128 GCDs; PCCL stays near-flat. At 2048 GCDs, PCCL all-gather is 7–24× over RCCL and 27–82× over Cray-MPICH.

PCCL vs RCCL all-gather heatmap — **Figure 11.** PCCL speedup over RCCL on Frontier. In the latency-bound regime (2048 GCDs, 16–64 MB), PCCL is >30× for all-gather and 50–100× for reduce-scatter. RCCL shows 200× higher NIC match-overflow and 120× more rendezvous PUT traffic at scale — direct evidence of its degradation.

PCCL vs RCCL reduce-scatter heatmap — **Figure 11.** PCCL speedup over RCCL on Frontier. In the latency-bound regime (2048 GCDs, 16–64 MB), PCCL is >30× for all-gather and 50–100× for reduce-scatter. RCCL shows 200× higher NIC match-overflow and 120× more rendezvous PUT traffic at scale — direct evidence of its degradation.

07 / End-to-End

Real Training Speedups

Communication gains translate into faster large-model training as scale increases — exactly where it matters.

ZeRO-3 strong scaling Frontier — **Figure 12.** DeepSpeed ZeRO-3 strong scaling for GPT-3 7B and 13B on Frontier (left) and Perlmutter (right). On Frontier, RCCL fails to keep scaling past 512 GCDs while PCCL continues — reaching 3.3–4.9× speedups at 2048 GCDs. On Perlmutter, PCCL pulls ahead of NCCL as scale grows (1.37× at 2048 GPUs).

ZeRO-3 strong scaling Perlmutter — **Figure 12.** DeepSpeed ZeRO-3 strong scaling for GPT-3 7B and 13B on Frontier (left) and Perlmutter (right). On Frontier, RCCL fails to keep scaling past 512 GCDs while PCCL continues — reaching 3.3–4.9× speedups at 2048 GCDs. On Perlmutter, PCCL pulls ahead of NCCL as scale grows (1.37× at 2048 GPUs).

DDP strong scaling Frontier — **Figure 13.** PyTorch DDP strong scaling for GPT-3 1.3B on Frontier. RCCL leads at small scale, but PCCL surpasses it at high GCD counts — 1.8× at 1024 and 2.4× at 2048 GCDs.

Bottom line

Existing collective libraries leave enormous performance on the table at scale: Cray-MPICH wastes NICs and the GPU, while NCCL/RCCL lack log-latency algorithms for all-gather and reduce-scatter. PCCL's hierarchical design plus ML-guided dispatching delivers 6–160× collective speedups over RCCL on 2048 GCDs of Frontier, and up to 4.9× (ZeRO-3) and 2.4× (DDP) in real LLM training — paving the way for scalable deep learning on next-generation GPU supercomputers.

08 / Cite

BibTeX

@inproceedings{singh2025pccl,
  title     = {The Big Send-off: High-Performance Collectives on
               GPU-based Supercomputers},
  author    = {Singh, Siddharth and Pradeep, Keshav and Singh, Mahua
               and Wei, Cunyang and Bhatele, Abhinav},
  year      = {2025},
  note      = {University of Maryland, College Park}
}