๐Ÿค– This page was automatically generated by Claude Code.
IEEE TPDS · Irregular GEMM

IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs

A unified, input-aware framework that generates high-performance kernels for Batch GEMM, Compact GEMM, and TSMM across ARMv8 and X86 โ€” adapting to both the application scenario and the CPU architecture.

Cunyang Wei Haipeng Jia Yunquan Zhang Jianyu Yao Chendi Li Wenxuan Cao
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences
14.6×
48-thread batch vs ARMPL
7.6×
Compact DGEMM vs BLIS (X86)
3
Irregular GEMM types unified
2
Architectures: ARMv8 & X86
01 / Overview

Abstract

Matrix multiplication is fundamental to linear algebra and scientific computing. Although mainstream BLAS libraries are highly tuned for large dense GEMM, they perform poorly on irregular inputs. This paper proposes an input-aware tuning framework that accounts for both the application scenario and the computer architecture to deliver high-performance irregular matrix multiplication on ARMv8 and X86 CPUs.

IrGEMM has two stages. The install-time stage uses a computational template to generate high-performance kernels for both general and SIMD-friendly data layouts. The run-time stage applies a tiling algorithm suited to irregular GEMM to select optimal kernels and link them into an execution plan, with load-balanced multi-threading to exploit modern multi-core CPUs. Experiments show significant improvements for irregular GEMM on both ARMv8 and X86 over other mainstream BLAS libraries.

batch GEMMcompact GEMMTSMMcode generationdynamic programmingARMv8 & X86
02 / TL;DR

Key Contributions

๐Ÿงญ

Scenario- & arch-aware framework

A performance-tuning framework that considers both the target application scenario and CPU architecture, with load-balanced multi-threaded scheduling โ€” demonstrated across all three irregular GEMM types.

๐Ÿ—๏ธ

Template-based code generation

The first comprehensive code-generation method for irregular matmul: computational templates + instruction-mapping rules emit highly-optimized assembly kernels for both ARMv8 and X86.

๐Ÿ“

Input-aware tiling

A dynamic-programming tiling algorithm that minimizes memory access and avoids tiny blocks, selecting optimal kernel combinations for consistent performance at any matrix scale.

๐Ÿš€

The IrGEMM library

A high-performance library for ARMv8 and Intel Cascade Lake that surpasses MKL, ARMPL, BLIS, LIBXSMM, and OpenBLAS across Batch GEMM, Compact GEMM, and TSMM.

This journal paper unifies and extends the authors' conference works (IAAT, IATF/ICPP, AutoTSMM, LBBGEMM/HPCC): a DP-based tiling algorithm, generalization to all three irregular types on both ARMv8 and X86 AVX512, redesigned kernel templates & code-mapping for Cascade Lake, and a deeper experimental analysis.

03 / The Problem

Three Faces of Irregular GEMM

Many applications โ€” metabolic networks, PDE simulations, finite-element tensor contractions, image processing, Transformer inference โ€” don't fit the large-dense-GEMM mold. IrGEMM unifies three irregular patterns under one framework.

Batch GEMM

Many groups of small matrices (โˆ›(MNK) โ‰ค 80); matrices share properties within a group but differ across groups. Needs load-balanced multi-thread scheduling.

Compact GEMM

Many same-size small matrices in a SIMD-friendly data layout, so one vector instruction processes several matrices at once โ€” filling the SIMD register width.

TSMM

Tall-and-skinny matrix multiply: one dimension far smaller than the others (short-fat A or tall-skinny B). Needs cache-blocking that adapts to the skinny shape, plus matrix reuse.

Why classic GEMM fails here

L2-based tiling is moot when matrices fit in cache; packing overhead dominates at small sizes; a few "main kernels" can't cover the many boundary sizes that irregular GEMM hits; and there's no input-aware planner. Each type adds its own twist โ€” load imbalance (Batch), SIMD underuse (Compact), wasted cache & coupled pack/compute (TSMM).

04 / Install-time

Template-based Code Generation

IrGEMM extracts the typical compute patterns of each irregular type into templates, generates "ping-pong" kernels from them, then maps to architecture-specific assembly โ€” so porting to a new CPU only means swapping the multiply/load instructions.

IrGEMM framework overview
Figure 1. The IrGEMM framework. Install-time: a computing-template designer, kernel generator, and kernel optimizer turn templates into high-performance assembly for general and SIMD-friendly layouts. Run-time: an input-aware tiling designer, cache/multi-thread optimizers, and an execution-plan generator.
SIMD-friendly data layout SIMD instruction mapping
Figure 2. The SIMD-friendly data layout (left) packs the same element of several matrices contiguously to fill the vector register; instruction-mapping rules (right) convert a kernel template into architecture-specific assembly (e.g. AVX512), so the same template targets both ARMv8 and X86.
Kernel optimization
Figure 3. The kernel optimizer reorders instructions and interleaves loads between compute ops (the "ping-pong" schedule) to eliminate pipeline bubbles and hide memory latency.
05 / Run-time

Input-aware Tiling & Scheduling

A dynamic-programming tiling algorithm picks the kernel combination that minimizes memory access and avoids tiny blocks; for Batch GEMM, a load-balanced scheduler maps task groups to threads.

Traditional tiling Irregular GEMM tiling
Figure 4. Traditional tiling (left) generates many tiny edge blocks; IrGEMM's input-aware tiling (right) avoids tiny fragments, keeping SIMD registers full and memory access hidden.
Batch GEMM multi-thread scheduling
Figure 5. Batch-GEMM scheduling: large matrix groups are split into L1-cache-sized task groups (command queues), then dynamically mapped to threads via an atomic counter for load balance.
TSMM cache blocking
Figure 6. TSMM needs cache-level blocking (unlike Batch/Compact, which fit in L2). IrGEMM blocks separately for short-fat A vs. tall-skinny B, and pre-packs reused large matrices (e.g. Transformer weights).
06 / Results

Performance on ARMv8 & X86

Evaluated on Kunpeng 920 (ARMv8.2, 48 cores) and Intel Xeon Gold 6240 (Cascade Lake, 18 cores) against MKL, ARMPL, BLIS, LIBXSMM, and OpenBLAS.

Batch GEMM โ€” multi-threaded

ARM batch SGEMM ARM batch DGEMM ARM batch CGEMM ARM batch ZGEMM
Figure 7. Multi-threaded Batch GEMM on ARMv8 vs. ARMPL, BLIS, LIBXSMM. IrGEMM scales where the baselines stall past 16 threads โ€” up to 4.5ร—/8.7ร—/8.4ร—/9.3ร—/13.1ร—/14.6ร— over ARMPL at 1/4/8/16/32/48 threads.
X86 batch SGEMM X86 batch DGEMM X86 batch CGEMM X86 batch ZGEMM
Figure 8. Multi-threaded Batch GEMM on X86 (Cascade Lake) vs. Intel MKL, BLIS, LIBXSMM. IrGEMM beats MKL on all non-SGEMM types โ€” up to 3.3ร—/2.9ร—/3.2ร—/4.4ร— at 1/4/8/16 threads.

Compact GEMM (NN mode)

ARM compact SGEMM ARM compact DGEMM ARM compact CGEMM ARM compact ZGEMM
Figure 9. Compact GEMM on ARMv8 vs. ARMPL/BLIS/LIBXSMM. For DGEMM, IrGEMM is up to 3.4ร—/3.2ร—/2.4ร— faster respectively โ€” the SIMD-friendly layout dominates at small sizes.
X86 compact SGEMM X86 compact DGEMM X86 compact CGEMM X86 compact ZGEMM
Figure 10. Compact GEMM on X86 vs. MKL compact, BLIS, LIBXSMM. For DGEMM, IrGEMM is up to 1.4ร—/7.6ร—/4.8ร— over MKL/BLIS/LIBXSMM; it matches MKL's compact interface (both use a SIMD-friendly layout) and beats the rest decisively.

TSMM (multi-threaded)

ARM TSMM short-fat A ARM TSMM tall-skinny B X86 TSMM short-fat A X86 TSMM tall-skinny B
Figure 11. Multi-threaded TSMM for short-fat A and tall-skinny B on ARM (48 threads) and X86 (16 threads). On ARM, DTSMM reaches up to 9.6ร—/9.5ร—/4.5ร— over ARMPL/BLIS/OpenBLAS; the shape-aware cache blocking especially helps where BLIS leaves cache underused.

Bottom line

Irregular GEMM is limited by both the architecture and its diverse application scenarios. IrGEMM unifies Batch GEMM, Compact GEMM, and TSMM under one input-aware framework โ€” template-based code generation for ARMv8 and X86, a DP-based input-aware tiling algorithm, and load-balanced multi-threading โ€” surpassing MKL, ARMPL, BLIS, LIBXSMM, and OpenBLAS across all three irregular types on both architectures.

07 / Cite

BibTeX

@article{wei_irgemm,
  title   = {IrGEMM: An Input-Aware Tuning Framework for Irregular
             GEMM on ARM and X86 CPUs},
  author  = {Wei, Cunyang and Jia, Haipeng and Zhang, Yunquan and
             Yao, Jianyu and Li, Chendi and Cao, Wenxuan},
  journal = {IEEE Transactions on Parallel and Distributed Systems
             (TPDS)},
  note    = {ICT, Chinese Academy of Sciences}
}