SABRE: Skew-aware Adaptive All-to-allv Algorithms for Dynamic Deep Learning Workloads

01 / Overview

Abstract

All-to-allv is a commonly used collective and a significant performance bottleneck in HPC and distributed deep learning. DL workloads in particular exhibit highly skewed data-size distributions, dynamically changing communication, and large messages — all of which make all-to-allv hard to optimize.

We present SABRE, a Skew-aware All-to-allv library for Balancing irRegular communication on GPU-based clusters. SABRE performs well under both highly-skewed and lightly-skewed patterns, selecting different optimization strategies in each regime. Through imbalance detection, adaptive algorithm selection, careful backend selection, and memory-management optimizations, it achieves up to 2.4× speedup over Cray MPICH and NCCL in microbenchmarks, and improves Mixture-of-Experts training time by up to 1.8× over the default PyTorch implementation.

collective communicationall-to-allvmixture of expertsdistributed deep learningload balancing

02 / TL;DR

Key Contributions

📐

Provably optimal balancing

We derive a tight lower bound for highly-skewed all-to-allv — achieved when each node spreads send/receive traffic evenly across its NICs — and a practical approximation via intra-node gathering, inter-node grouping, and redistribution.

🔀

Pipelined 2D algorithm

For lightly-skewed traffic, a pipelined two-dimensional all-to-allv overlaps intra-node and inter-node communication to maximize bandwidth while controlling network congestion.

🎯

Runtime skew detection

A maximum-to-mean (MTM) ratio classifies every all-to-allv invocation as highly- or lightly-skewed and dispatches it to the right algorithm — adapting to MoE routing that changes each iteration.

⚡

Real, drop-in speedups

Up to 2.4× over Cray MPICH and NCCL in microbenchmarks and 1.28–1.79× end-to-end in Megatron-LM MoE training — via a Python API that replaces dist.all_to_all_single.

03 / Motivation

Why Existing All-to-allv Falls Short

GPU supercomputers have far higher intra-node bandwidth than inter-node bandwidth, and MoE traffic is skewed and dynamic. Static and scheduling-based algorithms can't handle both at once.

Static algorithms

Fan-out, spread-out, and Bruck variants assume uniform bandwidth and fixed schedules. They can't exploit fast intra-node links, and the busiest rank becomes a long-tail bottleneck. Bruck's extra traffic is counterproductive for MB–GB messages.

Scheduling algorithms

TACCL/FAST compute near-optimal schedules but cost seconds to over an hour (TACCL: >1h for 64 GPUs). A single transfer finishes in milliseconds, and MoE routing shifts every iteration — there's no schedule to reuse.

All-to-all algorithm comparison on Perlmutter — **Figure 1.** Performance of all-to-all algorithms scaling 16→256 GPUs with a 128 MB message (balanced case). Even with no skew, traditional static algorithms hit fundamental limits on modern GPU systems; NCCL-based fan-out is the strongest baseline thanks to topology-awareness and multichannel parallelism.

04 / Design

One Adaptive Library, Two Regimes

SABRE measures imbalance with a single scalar — the maximum-to-mean (MTM) ratio — then dispatches each call to the algorithm best suited to its skew.

SABRE library components — **Figure 2.** SABRE computes the degree of imbalance from the communication matrix, then the algorithm selector routes the call to the highly-skewed or lightly-skewed algorithm.

Skewness metric

For each rank, let s_i/r_i be the inter-node data it sends/receives. MTM = max(Send-MTM, Recv-MTM), where each is the peak volume normalized by the per-process mean. A higher MTM means more severe skew — and it's cheap enough to compute on every invocation.

High MTM

Highly-skewed

A few NICs carry most traffic; the rest idle
Bottleneck = the most-overloaded NIC (long tail)
Fix: balance each node's send/receive evenly across its NICs

Low MTM

Lightly-skewed

Most processes send similar volumes
Bottleneck = inter-node congestion from many concurrent flows
Fix: overlap intra-/inter-node phases, group transfers

05 / Algorithms

The Two Algorithms

Highly-skewed: provably optimal NIC balancing

We prove that completion time is bounded below by T_LB = max(max_u S_u/mC, max_w R_w/mC) — the heaviest node draining all its data through its m NICs. Spreading each node's traffic evenly across its NICs achieves this bound exactly (T★ = T_LB), making equal-spreading globally optimal, not just a heuristic.

Highly-skewed 2D all-to-allv algorithm — **Figure 3.** The three-phase highly-skewed algorithm: (1) intra-node gathering + greedy load balancing with ID-matching and dynamic block splitting, (2) grouped, batched inter-node exchange (an N×N node-level pattern rather than dense P×P), (3) intra-node distribution and final assembly. Sending and receiving sides run the same splitting plan, avoiding any metadata exchange.

Lightly-skewed: pipelined 2D overlap

When traffic is balanced, the enemy is congestion. SABRE partitions the system into a 2D mesh and overlaps intra-node forwarding with inter-node exchange, using abundant intra-node bandwidth to relieve pressure on the network fabric while grouping inter-node messages.

Lightly-skewed 2D pipelined all-to-allv algorithm — **Figure 4.** The lightly-skewed pipelined 2D all-to-allv. Overlapping intra-node relay with inter-node exchange keeps fast links busy and cuts contention on inter-node links — the overlap is what turns 2D grouping into consistent speedups.

06 / Results

Performance on Perlmutter

Evaluated on Perlmutter (4× A100 + 4× Cassini NICs per node) using realistic communication matrices extracted from MoE training jobs.

Highly-skewed all-to-allv

Highly-skewed results 128MB — **Figure 5.** Completion time under highly-skewed patterns (128 MB, 256 MB) scaling 16→256 GPUs. Cray MPICH and NCCL track each other closely — both limited by stragglers. SABRE removes the stragglers, achieving 1.8×, 1.9×, 2.3×, 2.4×, and 2.4× speedup at 16/32/64/128/256 GPUs.

Highly-skewed results 256MB — **Figure 5.** Completion time under highly-skewed patterns (128 MB, 256 MB) scaling 16→256 GPUs. Cray MPICH and NCCL track each other closely — both limited by stragglers. SABRE removes the stragglers, achieving 1.8×, 1.9×, 2.3×, 2.4×, and 2.4× speedup at 16/32/64/128/256 GPUs.

Lightly-skewed all-to-allv

Lightly-skewed results 128MB — **Figure 6.** Lightly-skewed patterns (128 MB, 256 MB). SABRE runs 10–30% faster than both baselines. Even SABRE w/o overlap (2D grouping only) beats NCCL and Cray MPICH — but the pipelined overlap is what delivers consistent gains.

Lightly-skewed results 256MB — **Figure 6.** Lightly-skewed patterns (128 MB, 256 MB). SABRE runs 10–30% faster than both baselines. Even SABRE w/o overlap (2D grouping only) beats NCCL and Cray MPICH — but the pipelined overlap is what delivers consistent gains.

Speedup heatmap highly-skewed — **Figure 7.** SABRE speedup over NCCL across message size × GPU count for highly-skewed (left) and lightly-skewed (right) regimes. Strong, consistent speedups across the board; the only soft spot is very high GPU counts with medium per-process totals, where per-peer blocks shrink below the pipeline's sweet spot.

Speedup heatmap lightly-skewed — **Figure 7.** SABRE speedup over NCCL across message size × GPU count for highly-skewed (left) and lightly-skewed (right) regimes. Strong, consistent speedups across the board; the only soft spot is very high GPU counts with medium per-process totals, where per-peer blocks shrink below the pipeline's sweet spot.

End-to-end MoE training (Megatron-LM)

End-to-end MoE training speedup — **Figure 8.** MoE training time vs. PyTorch's default all_to_all_single, with expert parallelism, one expert per GPU, top-k=2 routing. SABRE recomputes MTM and re-selects the algorithm every iteration (overhead included), delivering 1.28×, 1.79×, 1.40× at 16/32/64 GPUs and 1.69×, 1.71× at 128/256 GPUs.

Bottom line

All-to-allv bottlenecks in MoE training come from two distinct sources — NIC-level load imbalance under skew, and inter-node congestion when balanced. SABRE detects which regime it's in via the MTM ratio and applies a provably-optimal balancing scheme or a pipelined 2D overlap accordingly, achieving up to 2.4× microbenchmark and 1.8× end-to-end speedups on Perlmutter as a drop-in replacement for PyTorch all-to-allv.

07 / Cite

BibTeX

@inproceedings{wei2026sabre,
  title     = {Skew-aware Adaptive All-to-allv Algorithms for
               Dynamic Deep Learning Workloads},
  author    = {Wei, Cunyang and Bhatele, Abhinav},
  booktitle = {Proceedings of the 2026 International Conference on
               Supercomputing (ICS '26)},
  year      = {2026},
  doi       = {10.1145/3797905.3800541}
}