IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs

01 / Overview

Abstract

Matrix multiplication is fundamental to linear algebra and scientific computing. Although mainstream BLAS libraries are highly tuned for large dense GEMM, they perform poorly on irregular inputs. This paper proposes an input-aware tuning framework that accounts for both the application scenario and the computer architecture to deliver high-performance irregular matrix multiplication on ARMv8 and X86 CPUs.

IrGEMM has two stages. The install-time stage uses a computational template to generate high-performance kernels for both general and SIMD-friendly data layouts. The run-time stage applies a tiling algorithm suited to irregular GEMM to select optimal kernels and link them into an execution plan, with load-balanced multi-threading to exploit modern multi-core CPUs. Experiments show significant improvements for irregular GEMM on both ARMv8 and X86 over other mainstream BLAS libraries.

batch GEMMcompact GEMMTSMMcode generationdynamic programmingARMv8 & X86

02 / TL;DR

Key Contributions

🧭

Scenario- & arch-aware framework

A performance-tuning framework that considers both the target application scenario and CPU architecture, with load-balanced multi-threaded scheduling — demonstrated across all three irregular GEMM types.

🏗️

Template-based code generation

The first comprehensive code-generation method for irregular matmul: computational templates + instruction-mapping rules emit highly-optimized assembly kernels for both ARMv8 and X86.

📐

Input-aware tiling

A dynamic-programming tiling algorithm that minimizes memory access and avoids tiny blocks, selecting optimal kernel combinations for consistent performance at any matrix scale.

🚀

The IrGEMM library

A high-performance library for ARMv8 and Intel Cascade Lake that surpasses MKL, ARMPL, BLIS, LIBXSMM, and OpenBLAS across Batch GEMM, Compact GEMM, and TSMM.

This journal paper unifies and extends the authors' conference works (IAAT, IATF/ICPP, AutoTSMM, LBBGEMM/HPCC): a DP-based tiling algorithm, generalization to all three irregular types on both ARMv8 and X86 AVX512, redesigned kernel templates & code-mapping for Cascade Lake, and a deeper experimental analysis.

03 / The Problem

Three Faces of Irregular GEMM

Many applications — metabolic networks, PDE simulations, finite-element tensor contractions, image processing, Transformer inference — don't fit the large-dense-GEMM mold. IrGEMM unifies three irregular patterns under one framework.

Batch GEMM

Many groups of small matrices (∛(MNK) ≤ 80); matrices share properties within a group but differ across groups. Needs load-balanced multi-thread scheduling.

Compact GEMM

Many same-size small matrices in a SIMD-friendly data layout, so one vector instruction processes several matrices at once — filling the SIMD register width.

TSMM

Tall-and-skinny matrix multiply: one dimension far smaller than the others (short-fat A or tall-skinny B). Needs cache-blocking that adapts to the skinny shape, plus matrix reuse.

Why classic GEMM fails here

L2-based tiling is moot when matrices fit in cache; packing overhead dominates at small sizes; a few "main kernels" can't cover the many boundary sizes that irregular GEMM hits; and there's no input-aware planner. Each type adds its own twist — load imbalance (Batch), SIMD underuse (Compact), wasted cache & coupled pack/compute (TSMM).

04 / Install-time

Template-based Code Generation

IrGEMM extracts the typical compute patterns of each irregular type into templates, generates "ping-pong" kernels from them, then maps to architecture-specific assembly — so porting to a new CPU only means swapping the multiply/load instructions.

IrGEMM framework overview — **Figure 1.** The IrGEMM framework. Install-time: a computing-template designer, kernel generator, and kernel optimizer turn templates into high-performance assembly for general and SIMD-friendly layouts. Run-time: an input-aware tiling designer, cache/multi-thread optimizers, and an execution-plan generator.

**Figure 2.** The SIMD-friendly data layout (left) packs the same element of several matrices contiguously to fill the vector register; instruction-mapping rules (right) convert a kernel template into architecture-specific assembly (e.g. AVX512), so the same template targets both ARMv8 and X86.

SIMD instruction mapping — **Figure 2.** The SIMD-friendly data layout (left) packs the same element of several matrices contiguously to fill the vector register; instruction-mapping rules (right) convert a kernel template into architecture-specific assembly (e.g. AVX512), so the same template targets both ARMv8 and X86.

Kernel optimization — **Figure 3.** The kernel optimizer reorders instructions and interleaves loads between compute ops (the "ping-pong" schedule) to eliminate pipeline bubbles and hide memory latency.

05 / Run-time

Input-aware Tiling & Scheduling

A dynamic-programming tiling algorithm picks the kernel combination that minimizes memory access and avoids tiny blocks; for Batch GEMM, a load-balanced scheduler maps task groups to threads.

**Figure 4.** Traditional tiling (left) generates many tiny edge blocks; IrGEMM's input-aware tiling (right) avoids tiny fragments, keeping SIMD registers full and memory access hidden.

Irregular GEMM tiling — **Figure 4.** Traditional tiling (left) generates many tiny edge blocks; IrGEMM's input-aware tiling (right) avoids tiny fragments, keeping SIMD registers full and memory access hidden.

Batch GEMM multi-thread scheduling — **Figure 5.** Batch-GEMM scheduling: large matrix groups are split into L1-cache-sized task groups (command queues), then dynamically mapped to threads via an atomic counter for load balance.

TSMM cache blocking — **Figure 6.** TSMM needs cache-level blocking (unlike Batch/Compact, which fit in L2). IrGEMM blocks separately for short-fat A vs. tall-skinny B, and pre-packs reused large matrices (e.g. Transformer weights).

06 / Results

Performance on ARMv8 & X86

Evaluated on Kunpeng 920 (ARMv8.2, 48 cores) and Intel Xeon Gold 6240 (Cascade Lake, 18 cores) against MKL, ARMPL, BLIS, LIBXSMM, and OpenBLAS.

Batch GEMM — multi-threaded

ARM batch SGEMM — **Figure 7.** Multi-threaded Batch GEMM on ARMv8 vs. ARMPL, BLIS, LIBXSMM. IrGEMM scales where the baselines stall past 16 threads — up to 4.5×/8.7×/8.4×/9.3×/13.1×/14.6× over ARMPL at 1/4/8/16/32/48 threads.

ARM batch DGEMM — **Figure 7.** Multi-threaded Batch GEMM on ARMv8 vs. ARMPL, BLIS, LIBXSMM. IrGEMM scales where the baselines stall past 16 threads — up to 4.5×/8.7×/8.4×/9.3×/13.1×/14.6× over ARMPL at 1/4/8/16/32/48 threads.

X86 batch SGEMM — **Figure 8.** Multi-threaded Batch GEMM on X86 (Cascade Lake) vs. Intel MKL, BLIS, LIBXSMM. IrGEMM beats MKL on all non-SGEMM types — up to 3.3×/2.9×/3.2×/4.4× at 1/4/8/16 threads.

X86 batch DGEMM — **Figure 8.** Multi-threaded Batch GEMM on X86 (Cascade Lake) vs. Intel MKL, BLIS, LIBXSMM. IrGEMM beats MKL on all non-SGEMM types — up to 3.3×/2.9×/3.2×/4.4× at 1/4/8/16 threads.

Compact GEMM (NN mode)

ARM compact SGEMM — **Figure 9.** Compact GEMM on ARMv8 vs. ARMPL/BLIS/LIBXSMM. For DGEMM, IrGEMM is up to 3.4×/3.2×/2.4× faster respectively — the SIMD-friendly layout dominates at small sizes.

ARM compact DGEMM — **Figure 9.** Compact GEMM on ARMv8 vs. ARMPL/BLIS/LIBXSMM. For DGEMM, IrGEMM is up to 3.4×/3.2×/2.4× faster respectively — the SIMD-friendly layout dominates at small sizes.

X86 compact SGEMM — **Figure 10.** Compact GEMM on X86 vs. MKL compact, BLIS, LIBXSMM. For DGEMM, IrGEMM is up to 1.4×/7.6×/4.8× over MKL/BLIS/LIBXSMM; it matches MKL's compact interface (both use a SIMD-friendly layout) and beats the rest decisively.

X86 compact DGEMM — **Figure 10.** Compact GEMM on X86 vs. MKL compact, BLIS, LIBXSMM. For DGEMM, IrGEMM is up to 1.4×/7.6×/4.8× over MKL/BLIS/LIBXSMM; it matches MKL's compact interface (both use a SIMD-friendly layout) and beats the rest decisively.

TSMM (multi-threaded)

ARM TSMM short-fat A — **Figure 11.** Multi-threaded TSMM for short-fat A and tall-skinny B on ARM (48 threads) and X86 (16 threads). On ARM, DTSMM reaches up to 9.6×/9.5×/4.5× over ARMPL/BLIS/OpenBLAS; the shape-aware cache blocking especially helps where BLIS leaves cache underused.

ARM TSMM tall-skinny B — **Figure 11.** Multi-threaded TSMM for short-fat A and tall-skinny B on ARM (48 threads) and X86 (16 threads). On ARM, DTSMM reaches up to 9.6×/9.5×/4.5× over ARMPL/BLIS/OpenBLAS; the shape-aware cache blocking especially helps where BLIS leaves cache underused.

Bottom line

Irregular GEMM is limited by both the architecture and its diverse application scenarios. IrGEMM unifies Batch GEMM, Compact GEMM, and TSMM under one input-aware framework — template-based code generation for ARMv8 and X86, a DP-based input-aware tiling algorithm, and load-balanced multi-threading — surpassing MKL, ARMPL, BLIS, LIBXSMM, and OpenBLAS across all three irregular types on both architectures.

07 / Cite

BibTeX

@article{wei_irgemm,
  title   = {IrGEMM: An Input-Aware Tuning Framework for Irregular
             GEMM on ARM and X86 CPUs},
  author  = {Wei, Cunyang and Jia, Haipeng and Zhang, Yunquan and
             Yao, Jianyu and Li, Chendi and Cao, Wenxuan},
  journal = {IEEE Transactions on Parallel and Distributed Systems
             (TPDS)},
  note    = {ICT, Chinese Academy of Sciences}
}