๐Ÿค– This page was automatically generated by Claude Code.
ICS 2026 · All-to-allv for MoE

Skew-aware Adaptive All-to-allv Algorithms for Dynamic Deep Learning Workloads

SABRE โ€” a Skew-aware All-to-allv library for Balancing irREgular communication on GPU clusters, built for the dynamic, skewed traffic of Mixture-of-Experts training.

Cunyang Wei Abhinav Bhatele
Department of Computer Science, University of Maryland, College Park
2.4×
Microbenchmark speedup vs MPICH/NCCL
1.8×
End-to-end MoE training speedup
256
GPUs evaluated on Perlmutter
2
Skew regimes, one adaptive library
01 / Overview

Abstract

All-to-allv is a commonly used collective and a significant performance bottleneck in HPC and distributed deep learning. DL workloads in particular exhibit highly skewed data-size distributions, dynamically changing communication, and large messages โ€” all of which make all-to-allv hard to optimize.

We present SABRE, a Skew-aware All-to-allv library for Balancing irRegular communication on GPU-based clusters. SABRE performs well under both highly-skewed and lightly-skewed patterns, selecting different optimization strategies in each regime. Through imbalance detection, adaptive algorithm selection, careful backend selection, and memory-management optimizations, it achieves up to 2.4× speedup over Cray MPICH and NCCL in microbenchmarks, and improves Mixture-of-Experts training time by up to 1.8× over the default PyTorch implementation.

collective communicationall-to-allvmixture of expertsdistributed deep learningload balancing
02 / TL;DR

Key Contributions

๐Ÿ“

Provably optimal balancing

We derive a tight lower bound for highly-skewed all-to-allv โ€” achieved when each node spreads send/receive traffic evenly across its NICs โ€” and a practical approximation via intra-node gathering, inter-node grouping, and redistribution.

๐Ÿ”€

Pipelined 2D algorithm

For lightly-skewed traffic, a pipelined two-dimensional all-to-allv overlaps intra-node and inter-node communication to maximize bandwidth while controlling network congestion.

๐ŸŽฏ

Runtime skew detection

A maximum-to-mean (MTM) ratio classifies every all-to-allv invocation as highly- or lightly-skewed and dispatches it to the right algorithm โ€” adapting to MoE routing that changes each iteration.

โšก

Real, drop-in speedups

Up to 2.4× over Cray MPICH and NCCL in microbenchmarks and 1.28โ€“1.79× end-to-end in Megatron-LM MoE training โ€” via a Python API that replaces dist.all_to_all_single.

03 / Motivation

Why Existing All-to-allv Falls Short

GPU supercomputers have far higher intra-node bandwidth than inter-node bandwidth, and MoE traffic is skewed and dynamic. Static and scheduling-based algorithms can't handle both at once.

Static algorithms

Fan-out, spread-out, and Bruck variants assume uniform bandwidth and fixed schedules. They can't exploit fast intra-node links, and the busiest rank becomes a long-tail bottleneck. Bruck's extra traffic is counterproductive for MBโ€“GB messages.

Scheduling algorithms

TACCL/FAST compute near-optimal schedules but cost seconds to over an hour (TACCL: >1h for 64 GPUs). A single transfer finishes in milliseconds, and MoE routing shifts every iteration โ€” there's no schedule to reuse.

All-to-all algorithm comparison on Perlmutter
Figure 1. Performance of all-to-all algorithms scaling 16โ†’256 GPUs with a 128 MB message (balanced case). Even with no skew, traditional static algorithms hit fundamental limits on modern GPU systems; NCCL-based fan-out is the strongest baseline thanks to topology-awareness and multichannel parallelism.
04 / Design

One Adaptive Library, Two Regimes

SABRE measures imbalance with a single scalar โ€” the maximum-to-mean (MTM) ratio โ€” then dispatches each call to the algorithm best suited to its skew.

SABRE library components
Figure 2. SABRE computes the degree of imbalance from the communication matrix, then the algorithm selector routes the call to the highly-skewed or lightly-skewed algorithm.
Skewness metric

For each rank, let si/ri be the inter-node data it sends/receives. MTM = max(Send-MTM, Recv-MTM), where each is the peak volume normalized by the per-process mean. A higher MTM means more severe skew โ€” and it's cheap enough to compute on every invocation.

High MTM

Highly-skewed

  • A few NICs carry most traffic; the rest idle
  • Bottleneck = the most-overloaded NIC (long tail)
  • Fix: balance each node's send/receive evenly across its NICs
Low MTM

Lightly-skewed

  • Most processes send similar volumes
  • Bottleneck = inter-node congestion from many concurrent flows
  • Fix: overlap intra-/inter-node phases, group transfers
05 / Algorithms

The Two Algorithms

Highly-skewed: provably optimal NIC balancing

We prove that completion time is bounded below by TLB = max(maxu Su/mC, maxw Rw/mC) โ€” the heaviest node draining all its data through its m NICs. Spreading each node's traffic evenly across its NICs achieves this bound exactly (Tโ˜… = TLB), making equal-spreading globally optimal, not just a heuristic.

Highly-skewed 2D all-to-allv algorithm
Figure 3. The three-phase highly-skewed algorithm: (1) intra-node gathering + greedy load balancing with ID-matching and dynamic block splitting, (2) grouped, batched inter-node exchange (an Nร—N node-level pattern rather than dense Pร—P), (3) intra-node distribution and final assembly. Sending and receiving sides run the same splitting plan, avoiding any metadata exchange.

Lightly-skewed: pipelined 2D overlap

When traffic is balanced, the enemy is congestion. SABRE partitions the system into a 2D mesh and overlaps intra-node forwarding with inter-node exchange, using abundant intra-node bandwidth to relieve pressure on the network fabric while grouping inter-node messages.

Lightly-skewed 2D pipelined all-to-allv algorithm
Figure 4. The lightly-skewed pipelined 2D all-to-allv. Overlapping intra-node relay with inter-node exchange keeps fast links busy and cuts contention on inter-node links โ€” the overlap is what turns 2D grouping into consistent speedups.
06 / Results

Performance on Perlmutter

Evaluated on Perlmutter (4ร— A100 + 4ร— Cassini NICs per node) using realistic communication matrices extracted from MoE training jobs.

Highly-skewed all-to-allv

Highly-skewed results 128MB Highly-skewed results 256MB
Figure 5. Completion time under highly-skewed patterns (128 MB, 256 MB) scaling 16โ†’256 GPUs. Cray MPICH and NCCL track each other closely โ€” both limited by stragglers. SABRE removes the stragglers, achieving 1.8×, 1.9×, 2.3×, 2.4×, and 2.4× speedup at 16/32/64/128/256 GPUs.

Lightly-skewed all-to-allv

Lightly-skewed results 128MB Lightly-skewed results 256MB
Figure 6. Lightly-skewed patterns (128 MB, 256 MB). SABRE runs 10โ€“30% faster than both baselines. Even SABRE w/o overlap (2D grouping only) beats NCCL and Cray MPICH โ€” but the pipelined overlap is what delivers consistent gains.
Speedup heatmap highly-skewed Speedup heatmap lightly-skewed
Figure 7. SABRE speedup over NCCL across message size ร— GPU count for highly-skewed (left) and lightly-skewed (right) regimes. Strong, consistent speedups across the board; the only soft spot is very high GPU counts with medium per-process totals, where per-peer blocks shrink below the pipeline's sweet spot.

End-to-end MoE training (Megatron-LM)

End-to-end MoE training speedup
Figure 8. MoE training time vs. PyTorch's default all_to_all_single, with expert parallelism, one expert per GPU, top-k=2 routing. SABRE recomputes MTM and re-selects the algorithm every iteration (overhead included), delivering 1.28×, 1.79×, 1.40× at 16/32/64 GPUs and 1.69×, 1.71× at 128/256 GPUs.

Bottom line

All-to-allv bottlenecks in MoE training come from two distinct sources โ€” NIC-level load imbalance under skew, and inter-node congestion when balanced. SABRE detects which regime it's in via the MTM ratio and applies a provably-optimal balancing scheme or a pipelined 2D overlap accordingly, achieving up to 2.4× microbenchmark and 1.8× end-to-end speedups on Perlmutter as a drop-in replacement for PyTorch all-to-allv.

07 / Cite

BibTeX

@inproceedings{wei2026sabre,
  title     = {Skew-aware Adaptive All-to-allv Algorithms for
               Dynamic Deep Learning Workloads},
  author    = {Wei, Cunyang and Bhatele, Abhinav},
  booktitle = {Proceedings of the 2026 International Conference on
               Supercomputing (ICS '26)},
  year      = {2026},
  doi       = {10.1145/3797905.3800541}
}