performance variability

Sep 2024 – Present

Modern HPC systems are increasingly challenged by performance variability that significantly impacts both scientific simulations and AI training — even minor delays on a single node can cause widespread job slowdowns. This is exacerbated by heterogeneous hardware, software jitter, and especially network contention, leading to inefficient resource usage and higher operational costs.

  • The first work to systematically investigate network-induced performance variability on modern GPU clusters, revealing that network delays are the dominant factor affecting overall system performance.
  • Conducted a longitudinal study on production systems such as Perlmutter and Frontier, collecting extensive real-world data across both traditional MPI applications and distributed deep-learning workloads.
  • Derived actionable strategies for mitigating network bottlenecks, advancing the efficiency of HPC and AI systems.

Published at IPDPS 2026. PDF · Best Poster Award Nominee at SC 2025. Poster

Website