The Case of the Elusive Application Performance on Production GPU Supercomputers

01 / Overview

Abstract

Modern HPC facilities increasingly rely on GPU-accelerated clusters to drive both scientific computing and AI workloads. Performance variability is a critical issue in these systems, undermining efficiency and performance reproducibility. While prior studies have extensively analyzed variability in CPU-centric supercomputers, similar large-scale investigations on GPU clusters are lacking.

To address this gap, we set up a longitudinal experiment on two state-of-the-art GPU-based supercomputers: NERSC's Perlmutter and ORNL's Frontier. We benchmark several representative HPC and AI applications and collect detailed performance data including network counters, profiling output, and job scheduler logs. We analyze this data to identify the impact of compute performance variations, allocated node topology, and network conditions on the overall runtime variability. We also use a machine-learning-based approach to identify potential correlations between these factors, and to forecast performance variability — providing actionable insights for both system administrators and users.

performance variabilityGPGPUsdragonfly networkAI workloadsXGBoost forecasting

02 / TL;DR

Key Contributions

📈

First longitudinal GPU study

A four-month study on two flagship GPU supercomputers, yielding a comprehensive dataset of performance measurements across diverse applications and system states — the first to quantify variability on GPU-accelerated HPC at this scale and duration.

🔬

Root-cause analysis

An in-depth analysis of hardware differences and the impacts of concurrent jobs, job placement, and network conditions — isolating compute variability from communication variability.

🤖

ML-based forecasting

XGBoost models that identify the critical metrics driving variability and predict run-to-run performance across both HPC and AI workloads, accurate even with only a handful of training samples.

🛠️

Actionable insights

Concrete recommendations for system administrators and users to predict and mitigate performance variations in GPU-accelerated environments.

Performance variability over time on Perlmutter — **Figure 1.** Variability in performance of four HPC and AI applications relative to their best observed execution times over four months in 2024–2025 (64-node jobs on Perlmutter, top, and Frontier, bottom). We observed up to **1.4×** (nanoGPT) and **1.3×** (AMG2023) variability on Perlmutter, and up to **2.6×** (DeepCAM) and **1.8×** (MILC) on Frontier, with outliers reaching **3.2×**.

Performance variability over time on Frontier — **Figure 1.** Variability in performance of four HPC and AI applications relative to their best observed execution times over four months in 2024–2025 (64-node jobs on Perlmutter, top, and Frontier, bottom). We observed up to **1.4×** (nanoGPT) and **1.3×** (AMG2023) variability on Perlmutter, and up to **2.6×** (DeepCAM) and **1.8×** (MILC) on Frontier, with outliers reaching **3.2×**.

03 / Experimental Setup

Systems & Applications

Both machines are HPE Cray EX systems built on a three-hop dragonfly topology with the HPE Slingshot-11 interconnect. We probe each system with four applications spanning traditional MPI-based HPC codes and modern NCCL/RCCL-based distributed AI training.

NERSC

Perlmutter

1,792 GPU nodes
4× NVIDIA A100 per node (40 / 80 GB HBM2)
64-core AMD EPYC 7763 Milan CPU
4 Cassini NICs · 100 GB/s injection / node
Slingshot-11, 3-hop dragonfly

OLCF

Frontier

9,408 nodes · first exascale system
4× AMD MI250X (8 GCDs, 64 GB HBM2e each)
64-core AMD EPYC Trento CPU
4 Slingshot NICs · 100 GB/s / node
Slingshot-11, 3-hop dragonfly

Application workloads & inputs (64 nodes)

Application	Type	Machine	Key input parameters	Jobs
AMG2023 Algebraic multigrid	MPI / HPC	Perlmutter	-P 4 8 8 -n 128 64 64 -problem 1	104
AMG2023 Algebraic multigrid	MPI / HPC	Frontier	-P 8 8 8 -n 128 64 64 -problem 1	168
MILC Lattice QCD	MPI / HPC	Perlmutter	nx 40 ny 160 nz 320 nt 320	78
MILC Lattice QCD	MPI / HPC	Frontier	nx 80 ny 160 nz 320 nt 320	38
DeepCAM Climate CNN	NCCL/RCCL / AI	Perlmutter	max_epochs 8, batch 2, dali-es-gpu	67
DeepCAM Climate CNN	NCCL/RCCL / AI	Frontier	max_epochs 4, batch 2, hdf5	109
nanoGPT 20B GPT via AxoNN	NCCL/RCCL / AI	Perlmutter	gpt2, 20B, batch 8, block 512, ga 256	94
nanoGPT 20B GPT via AxoNN	NCCL/RCCL / AI	Frontier	gpt2, 20B, batch 8, block 512, ga 512	103

For each run we capture application runtime, MPI (mpiP) or PyTorch profiles, low-level Cassini NIC counters, and Slurm job logs. Two micro-benchmarks — an FP16 GEMM and an Allreduce — run before each application to probe raw compute and communication performance.

04 / Where variability comes from

Application & GPU Variability

Breaking down execution time reveals a consistent story: compute is stable, but collective communication — especially Allreduce — drives variability.

AMG2023 breakdown Perlmutter — **Figure 2.** Execution-time breakdowns for AMG2023 (Perlmutter) and MILC (Frontier). On average ~74–84% of AMG2023's runtime is MPI; the slowest AMG2023 communication phase is ~40% longer than the fastest. MILC's slowest communication is up to 50% (Perlmutter) and 122% (Frontier) higher than the fastest.

MILC breakdown Frontier — **Figure 2.** Execution-time breakdowns for AMG2023 (Perlmutter) and MILC (Frontier). On average ~74–84% of AMG2023's runtime is MPI; the slowest AMG2023 communication phase is ~40% longer than the fastest. MILC's slowest communication is up to 50% (Perlmutter) and 122% (Frontier) higher than the fastest.

DeepCAM breakdown Frontier — **Figure 3.** Breakdowns for DeepCAM (Frontier) and nanoGPT (Perlmutter). DeepCAM's Allreduce can take up to 4× (Perlmutter) or 24× (Frontier) longer in the slowest runs. nanoGPT's Allreduce is up to 3× slower on Perlmutter, but remarkably consistent on Frontier.

nanoGPT breakdown Perlmutter — **Figure 3.** Breakdowns for DeepCAM (Frontier) and nanoGPT (Perlmutter). DeepCAM's Allreduce can take up to 4× (Perlmutter) or 24× (Frontier) longer in the slowest runs. nanoGPT's Allreduce is up to 3× slower on Perlmutter, but remarkably consistent on Frontier.

Takeaway 1

Performance variability arises primarily due to slowdowns in collective communication — in particular, Allreduce, Test, and Waitall routines. Some routines exhibit long-tail effects.

Is it the GPUs? Isolating compute variability

We ran standalone FP16 GEMM benchmarks at three granularities — individual GPU, individual node, and system-wide — to separate compute variability from communication variability.

GEMM per GPU Perlmutter — **Figure 4.** Relative GEMM performance on Perlmutter at GPU (left), node (middle) and system (right) granularity. A single GPU is stable over time (<2.5% window), yet intra-node differences reach 10–17% and system-wide variability reaches up to 28% — even among identical A100 models. A100 80 GB GPUs are both more consistent and ~7% faster than the 40 GB parts.

GEMM per node Perlmutter — **Figure 4.** Relative GEMM performance on Perlmutter at GPU (left), node (middle) and system (right) granularity. A single GPU is stable over time (<2.5% window), yet intra-node differences reach 10–17% and system-wide variability reaches up to 28% — even among identical A100 models. A100 80 GB GPUs are both more consistent and ~7% faster than the 40 GB parts.

Slow GPU vs MILC runtime Perlmutter — **Figure 5.** Number of "slowest 1%" GPUs in an allocation vs. MILC runtime. Spearman correlation is just 0.07 (Perlmutter) and 0.08 (Frontier) — slow GPUs do *not* explain application-level variability.

Slow GPU vs MILC runtime Frontier — **Figure 5.** Number of "slowest 1%" GPUs in an allocation vs. MILC runtime. Spearman correlation is just 0.07 (Perlmutter) and 0.08 (Frontier) — slow GPUs do *not* explain application-level variability.

Takeaway 2

While single-GPU performance is relatively stable over time (especially on Perlmutter), there is notable variability across GPUs. GEMM variability is higher on Frontier but with fewer extreme outliers. Crucially, GPU-level variability does not correlate with application-level variability.

05 / Job Placement & Neighbors

Topology and Concurrent Jobs

If communication drives variability, does where a job lands — and who it shares the network with — matter? We find the dragonfly group count is irrelevant, but a few noisy neighbors are not.

Dragonfly groups vs nanoGPT runtime Perlmutter — **Figure 6.** Number of dragonfly groups a job spans vs. runtime (nanoGPT/Perlmutter, DeepCAM/Frontier). Spearman correlations of 0.33 and 0.08 — spanning 1 group or 64 groups makes no meaningful difference, a testament to robust UGAL adaptive routing.

Dragonfly groups vs DeepCAM runtime Frontier — **Figure 6.** Number of dragonfly groups a job spans vs. runtime (nanoGPT/Perlmutter, DeepCAM/Frontier). Spearman correlations of 0.33 and 0.08 — spanning 1 group or 64 groups makes no meaningful difference, a testament to robust UGAL adaptive routing.

Takeaway 3

Dragonfly topology implementations on both Perlmutter and Frontier maintain high performance and scalability, even when computational tasks are allocated to a large number of dragonfly groups.

Not total load — specific noisy neighbors

Aggregate system-wide node usage barely correlates with runtime (Spearman 0.03 / 0.39). But when we isolate "Top Users" running communication-intensive jobs, a clear threshold effect emerges.

Total concurrent nodes vs runtime Perlmutter — **Figure 7.** Total nodes used by all *relevant* concurrent jobs vs. runtime — weak or no correlation. Aggregate utilization alone fails to explain variability.

Total concurrent nodes vs runtime Frontier — **Figure 7.** Total nodes used by all *relevant* concurrent jobs vs. runtime — weak or no correlation. Aggregate utilization alone fails to explain variability.

Top user nodes vs AMG2023 runtime Perlmutter — **Figure 8.** Nodes held by **Top Users** vs. AMG2023 runtime. Spearman jumps to 0.55 / 0.60. On Perlmutter, AMG2023 slows ≥7% once Top Users hold >300 nodes; ~15% slowdowns coincided with large vasp_gam allocations from one user.

Top user nodes vs AMG2023 runtime Frontier — **Figure 8.** Nodes held by **Top Users** vs. AMG2023 runtime. Spearman jumps to 0.55 / 0.60. On Perlmutter, AMG2023 slows ≥7% once Top Users hold >300 nodes; ~15% slowdowns coincided with large vasp_gam allocations from one user.

Takeaway 4

Overall system utilization alone does not explain the observed performance degradation; a few specific neighbors with high communication intensity can cause most of the performance variability.

06 / Statistical & ML Analysis

Forecasting Performance with XGBoost

Static correlation of NIC counters tells only part of the story. We train XGBoost regression models on placement, GEMM, Allreduce, and Cassini NIC-counter features to predict runtime and identify what matters most.

NIC counter correlation heatmap Perlmutter — **Figure 9.** Correlation between application runtime and mean/max NIC counters (Perlmutter). Retry/timeout counters (rh:sct/spt_timeouts) and backpressure counters (hni_rx/tx_paused) show positive correlation with runtime, but counters in isolation don't capture the full complexity.

Takeaway 5

While NIC counters show some correlation with performance variability, they do not fully capture the complexity involved. More sophisticated models are needed to identify the relationships affecting performance.

MAPE and Direction Accuracy — **Figure 10.** XGBoost predictions across incremental feature sets. Adding NIC counters sharply lowers MAPE — especially for highly variable DeepCAM — and pushes Direction Accuracy near 1.0. Predicted vs. actual runtimes (right) track closely on both systems. Even with only 7 MILC training runs on Frontier, predictions remain strong, evidence the model generalizes to the *system*, not just one app.

Predicted vs actual Perlmutter — **Figure 10.** XGBoost predictions across incremental feature sets. Adding NIC counters sharply lowers MAPE — especially for highly variable DeepCAM — and pushes Direction Accuracy near 1.0. Predicted vs. actual runtimes (right) track closely on both systems. Even with only 7 MILC training runs on Frontier, predictions remain strong, evidence the model generalizes to the *system*, not just one app.

Feature importances — **Figure 11.** XGBoost feature importances. On Perlmutter, hni_rx_paused_0_mean and allreduce_2GB dominate (NIC and system congestion). On Frontier, message-matching, ATU cache-hit, and blocked non-posted-path counters dominate — uneven data movement drives variability.

Takeaway 6

Traffic saturation causes Perlmutter's network processing to stall, while Frontier's blocking reads and cache hits expose bottlenecks in local data movement. Network-driven behavior dictates performance variability on both machines.

07 / What to do about it

Insights & Mitigations

🛡️ For System Administrators

Several NIC counters are strongly tied to variability. Periodically apply predictive methods to proactively warn performance-sensitive users of degradation.
Systems like Perlmutter already collect LDMS network-counter telemetry in real time — enough to train a universal degradation predictor.
Monitor and cap concurrent communication-heavy jobs, or isolate them to a dedicated dragonfly group to keep the system healthy.

👩‍💻 For Users

Variability can be predicted with only a small set of your own profiling data.
At job start, predict expected variability from the allocated nodes' state.
If significant degradation is forecast, cancel early and resubmit — saving node hours.

Bottom line

Across both HPC and AI workloads on Perlmutter and Frontier, network performance — not GPU compute or job placement — is the dominant driver of run-to-run variability. Inherent GPU differences and dragonfly group count have little effect at scale; instead, network contention from a few communication-intensive neighbors, captured by NIC counters, governs performance. An ML model trained on these signals predicts runtime accurately, even for applications with only a few training samples.

08 / Cite

BibTeX

@inproceedings{wei2026elusive,
  title     = {The Case of the Elusive Application Performance on
               Production GPU Supercomputers},
  author    = {Wei, Cunyang and Pradeep, Keshav and Bhatele, Abhinav},
  booktitle = {Proceedings of the IEEE International Parallel and
               Distributed Processing Symposium (IPDPS)},
  year      = {2026},
  note      = {University of Maryland, College Park}
}

Acknowledgment. Supported by NSF Grant No. 2047120. Used NERSC (DOE Office of Science, DE-AC02-05CH11231; awards DDR-ERCAP0034262, ALCC-ERCAP0034775) and the Oak Ridge Leadership Computing Facility (DOE, DE-AC05-00OR22725).