performance variability

Modern HPC systems are increasingly challenged by performance variability that significantly impacts both scientific simulations and AI training — even minor delays on a single node can cause widespread job slowdowns. This is exacerbated by heterogeneous hardware, software jitter, and especially network contention, leading to inefficient resource usage and higher operational costs.

The first work to systematically investigate network-induced performance variability on modern GPU clusters, revealing that network delays are the dominant factor affecting overall system performance.
Conducted a longitudinal study on production systems such as Perlmutter and Frontier, collecting extensive real-world data across both traditional MPI applications and distributed deep-learning workloads.
Derived actionable strategies for mitigating network bottlenecks, advancing the efficiency of HPC and AI systems.

Published at IPDPS 2026. PDF · Best Poster Award Nominee at SC 2025. Poster

Website