๐Ÿค– This page was automatically generated by Claude Code.
IPDPS 2026 · Longitudinal Study

The Case of the Elusive Application Performance on Production GPU Supercomputers

Characterizing and predicting run-to-run performance variability of HPC and AI workloads on Perlmutter (NERSC) and Frontier (OLCF).

Cunyang Wei Keshav Pradeep Abhinav Bhatele
Department of Computer Science, University of Maryland, College Park
3.2×
Peak observed slowdown (Frontier)
761
Application runs collected
4
Months of measurements
2
Exascale-class systems
01 / Overview

Abstract

Modern HPC facilities increasingly rely on GPU-accelerated clusters to drive both scientific computing and AI workloads. Performance variability is a critical issue in these systems, undermining efficiency and performance reproducibility. While prior studies have extensively analyzed variability in CPU-centric supercomputers, similar large-scale investigations on GPU clusters are lacking.

To address this gap, we set up a longitudinal experiment on two state-of-the-art GPU-based supercomputers: NERSC's Perlmutter and ORNL's Frontier. We benchmark several representative HPC and AI applications and collect detailed performance data including network counters, profiling output, and job scheduler logs. We analyze this data to identify the impact of compute performance variations, allocated node topology, and network conditions on the overall runtime variability. We also use a machine-learning-based approach to identify potential correlations between these factors, and to forecast performance variability โ€” providing actionable insights for both system administrators and users.

performance variabilityGPGPUsdragonfly networkAI workloadsXGBoost forecasting
02 / TL;DR

Key Contributions

๐Ÿ“ˆ

First longitudinal GPU study

A four-month study on two flagship GPU supercomputers, yielding a comprehensive dataset of performance measurements across diverse applications and system states โ€” the first to quantify variability on GPU-accelerated HPC at this scale and duration.

๐Ÿ”ฌ

Root-cause analysis

An in-depth analysis of hardware differences and the impacts of concurrent jobs, job placement, and network conditions โ€” isolating compute variability from communication variability.

๐Ÿค–

ML-based forecasting

XGBoost models that identify the critical metrics driving variability and predict run-to-run performance across both HPC and AI workloads, accurate even with only a handful of training samples.

๐Ÿ› ๏ธ

Actionable insights

Concrete recommendations for system administrators and users to predict and mitigate performance variations in GPU-accelerated environments.

Performance variability over time on Perlmutter Performance variability over time on Frontier
Figure 1. Variability in performance of four HPC and AI applications relative to their best observed execution times over four months in 2024โ€“2025 (64-node jobs on Perlmutter, top, and Frontier, bottom). We observed up to 1.4× (nanoGPT) and 1.3× (AMG2023) variability on Perlmutter, and up to 2.6× (DeepCAM) and 1.8× (MILC) on Frontier, with outliers reaching 3.2×.
03 / Experimental Setup

Systems & Applications

Both machines are HPE Cray EX systems built on a three-hop dragonfly topology with the HPE Slingshot-11 interconnect. We probe each system with four applications spanning traditional MPI-based HPC codes and modern NCCL/RCCL-based distributed AI training.

NERSC

Perlmutter

  • 1,792 GPU nodes
  • 4× NVIDIA A100 per node (40 / 80 GB HBM2)
  • 64-core AMD EPYC 7763 Milan CPU
  • 4 Cassini NICs · 100 GB/s injection / node
  • Slingshot-11, 3-hop dragonfly
OLCF

Frontier

  • 9,408 nodes · first exascale system
  • 4× AMD MI250X (8 GCDs, 64 GB HBM2e each)
  • 64-core AMD EPYC Trento CPU
  • 4 Slingshot NICs · 100 GB/s / node
  • Slingshot-11, 3-hop dragonfly

Application workloads & inputs (64 nodes)

ApplicationTypeMachineKey input parametersJobs
AMG2023
Algebraic multigrid
MPI / HPCPerlmutter-P 4 8 8 -n 128 64 64 -problem 1104
Frontier-P 8 8 8 -n 128 64 64 -problem 1168
MILC
Lattice QCD
MPI / HPCPerlmutternx 40 ny 160 nz 320 nt 32078
Frontiernx 80 ny 160 nz 320 nt 32038
DeepCAM
Climate CNN
NCCL/RCCL / AIPerlmuttermax_epochs 8, batch 2, dali-es-gpu67
Frontiermax_epochs 4, batch 2, hdf5109
nanoGPT
20B GPT via AxoNN
NCCL/RCCL / AIPerlmuttergpt2, 20B, batch 8, block 512, ga 25694
Frontiergpt2, 20B, batch 8, block 512, ga 512103

For each run we capture application runtime, MPI (mpiP) or PyTorch profiles, low-level Cassini NIC counters, and Slurm job logs. Two micro-benchmarks โ€” an FP16 GEMM and an Allreduce โ€” run before each application to probe raw compute and communication performance.

04 / Where variability comes from

Application & GPU Variability

Breaking down execution time reveals a consistent story: compute is stable, but collective communication โ€” especially Allreduce โ€” drives variability.

AMG2023 breakdown Perlmutter MILC breakdown Frontier
Figure 2. Execution-time breakdowns for AMG2023 (Perlmutter) and MILC (Frontier). On average ~74โ€“84% of AMG2023's runtime is MPI; the slowest AMG2023 communication phase is ~40% longer than the fastest. MILC's slowest communication is up to 50% (Perlmutter) and 122% (Frontier) higher than the fastest.
DeepCAM breakdown Frontier nanoGPT breakdown Perlmutter
Figure 3. Breakdowns for DeepCAM (Frontier) and nanoGPT (Perlmutter). DeepCAM's Allreduce can take up to 4× (Perlmutter) or 24× (Frontier) longer in the slowest runs. nanoGPT's Allreduce is up to 3× slower on Perlmutter, but remarkably consistent on Frontier.
Takeaway 1

Performance variability arises primarily due to slowdowns in collective communication โ€” in particular, Allreduce, Test, and Waitall routines. Some routines exhibit long-tail effects.

Is it the GPUs? Isolating compute variability

We ran standalone FP16 GEMM benchmarks at three granularities โ€” individual GPU, individual node, and system-wide โ€” to separate compute variability from communication variability.

GEMM per GPU Perlmutter GEMM per node Perlmutter GEMM system Perlmutter
Figure 4. Relative GEMM performance on Perlmutter at GPU (left), node (middle) and system (right) granularity. A single GPU is stable over time (<2.5% window), yet intra-node differences reach 10โ€“17% and system-wide variability reaches up to 28% โ€” even among identical A100 models. A100 80 GB GPUs are both more consistent and ~7% faster than the 40 GB parts.
Slow GPU vs MILC runtime Perlmutter Slow GPU vs MILC runtime Frontier
Figure 5. Number of "slowest 1%" GPUs in an allocation vs. MILC runtime. Spearman correlation is just 0.07 (Perlmutter) and 0.08 (Frontier) โ€” slow GPUs do not explain application-level variability.
Takeaway 2

While single-GPU performance is relatively stable over time (especially on Perlmutter), there is notable variability across GPUs. GEMM variability is higher on Frontier but with fewer extreme outliers. Crucially, GPU-level variability does not correlate with application-level variability.

05 / Job Placement & Neighbors

Topology and Concurrent Jobs

If communication drives variability, does where a job lands โ€” and who it shares the network with โ€” matter? We find the dragonfly group count is irrelevant, but a few noisy neighbors are not.

Dragonfly groups vs nanoGPT runtime Perlmutter Dragonfly groups vs DeepCAM runtime Frontier
Figure 6. Number of dragonfly groups a job spans vs. runtime (nanoGPT/Perlmutter, DeepCAM/Frontier). Spearman correlations of 0.33 and 0.08 โ€” spanning 1 group or 64 groups makes no meaningful difference, a testament to robust UGAL adaptive routing.
Takeaway 3

Dragonfly topology implementations on both Perlmutter and Frontier maintain high performance and scalability, even when computational tasks are allocated to a large number of dragonfly groups.

Not total load โ€” specific noisy neighbors

Aggregate system-wide node usage barely correlates with runtime (Spearman 0.03 / 0.39). But when we isolate "Top Users" running communication-intensive jobs, a clear threshold effect emerges.

Total concurrent nodes vs runtime Perlmutter Total concurrent nodes vs runtime Frontier
Figure 7. Total nodes used by all relevant concurrent jobs vs. runtime โ€” weak or no correlation. Aggregate utilization alone fails to explain variability.
Top user nodes vs AMG2023 runtime Perlmutter Top user nodes vs AMG2023 runtime Frontier
Figure 8. Nodes held by Top Users vs. AMG2023 runtime. Spearman jumps to 0.55 / 0.60. On Perlmutter, AMG2023 slows โ‰ฅ7% once Top Users hold >300 nodes; ~15% slowdowns coincided with large vasp_gam allocations from one user.
Takeaway 4

Overall system utilization alone does not explain the observed performance degradation; a few specific neighbors with high communication intensity can cause most of the performance variability.

06 / Statistical & ML Analysis

Forecasting Performance with XGBoost

Static correlation of NIC counters tells only part of the story. We train XGBoost regression models on placement, GEMM, Allreduce, and Cassini NIC-counter features to predict runtime and identify what matters most.

NIC counter correlation heatmap Perlmutter
Figure 9. Correlation between application runtime and mean/max NIC counters (Perlmutter). Retry/timeout counters (rh:sct/spt_timeouts) and backpressure counters (hni_rx/tx_paused) show positive correlation with runtime, but counters in isolation don't capture the full complexity.
Takeaway 5

While NIC counters show some correlation with performance variability, they do not fully capture the complexity involved. More sophisticated models are needed to identify the relationships affecting performance.

MAPE and Direction Accuracy Predicted vs actual Perlmutter Predicted vs actual Frontier
Figure 10. XGBoost predictions across incremental feature sets. Adding NIC counters sharply lowers MAPE โ€” especially for highly variable DeepCAM โ€” and pushes Direction Accuracy near 1.0. Predicted vs. actual runtimes (right) track closely on both systems. Even with only 7 MILC training runs on Frontier, predictions remain strong, evidence the model generalizes to the system, not just one app.
Feature importances
Figure 11. XGBoost feature importances. On Perlmutter, hni_rx_paused_0_mean and allreduce_2GB dominate (NIC and system congestion). On Frontier, message-matching, ATU cache-hit, and blocked non-posted-path counters dominate โ€” uneven data movement drives variability.
Takeaway 6

Traffic saturation causes Perlmutter's network processing to stall, while Frontier's blocking reads and cache hits expose bottlenecks in local data movement. Network-driven behavior dictates performance variability on both machines.

07 / What to do about it

Insights & Mitigations

๐Ÿ›ก๏ธ For System Administrators

  • Several NIC counters are strongly tied to variability. Periodically apply predictive methods to proactively warn performance-sensitive users of degradation.
  • Systems like Perlmutter already collect LDMS network-counter telemetry in real time โ€” enough to train a universal degradation predictor.
  • Monitor and cap concurrent communication-heavy jobs, or isolate them to a dedicated dragonfly group to keep the system healthy.

๐Ÿ‘ฉโ€๐Ÿ’ป For Users

  • Variability can be predicted with only a small set of your own profiling data.
  • At job start, predict expected variability from the allocated nodes' state.
  • If significant degradation is forecast, cancel early and resubmit โ€” saving node hours.

Bottom line

Across both HPC and AI workloads on Perlmutter and Frontier, network performance โ€” not GPU compute or job placement โ€” is the dominant driver of run-to-run variability. Inherent GPU differences and dragonfly group count have little effect at scale; instead, network contention from a few communication-intensive neighbors, captured by NIC counters, governs performance. An ML model trained on these signals predicts runtime accurately, even for applications with only a few training samples.

08 / Cite

BibTeX

@inproceedings{wei2026elusive,
  title     = {The Case of the Elusive Application Performance on
               Production GPU Supercomputers},
  author    = {Wei, Cunyang and Pradeep, Keshav and Bhatele, Abhinav},
  booktitle = {Proceedings of the IEEE International Parallel and
               Distributed Processing Symposium (IPDPS)},
  year      = {2026},
  note      = {University of Maryland, College Park}
}

Acknowledgment. Supported by NSF Grant No. 2047120. Used NERSC (DOE Office of Science, DE-AC02-05CH11231; awards DDR-ERCAP0034262, ALCC-ERCAP0034775) and the Oak Ridge Leadership Computing Facility (DOE, DE-AC05-00OR22725).