🤖 This page was automatically generated by Claude Code.
ICPP 2022 · Compact Batched BLAS

IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUs

A two-stage framework that generates and auto-tunes high-performance compact batched GEMM and TRSM kernels for large groups of fixed-size small matrices on ARMv8.

Cunyang Wei Haipeng Jia* Yunquan Zhang Liusha Xu Ji Qi
SKL of Processors, ICT, Chinese Academy of Sciences · UCAS · Huawei Technologies
28×
STRSM vs OpenBLAS
21×
SGEMM vs OpenBLAS
vs ARMPL batched GEMM
4
Data types · all transpose modes
01 / Overview

Abstract

Mainstream BLAS libraries deliver high performance on large-scale GEMM and TRSM, but remain insufficient for batch operations on large groups of fixed-size small matrices — a pattern widely used across scientific computing. We propose IATF, an input-aware tuning framework that boosts near-optimal performance for large groups of fixed-size small GEMM and TRSM on the ARMv8 architecture.

IATF has two stages. At install time, using a SIMD-friendly data layout, it builds computing-kernel templates for GEMM and TRSM, derives optimal kernel sizes to raise the computational-instruction ratio, and applies kernel optimizations plus an optimized data-packing strategy to cut memory-access overhead. At run time, it generates an efficient execution plan from the input matrix properties. Experiments show significant improvements in both GEMM and TRSM over other mainstream BLAS libraries.

compact batched BLASauto-tunecode generationGEMMTRSMARMv8 / Kunpeng 920
02 / TL;DR

Key Contributions

🧩

Input-aware tuning framework

A high-performance framework for large groups of small-matrix operations that selects an optimal execution plan based on input matrix properties — size, transpose mode, side, triangle, and unit-diagonal.

⚙️

Kernel & packing design

A SIMD-friendly data layout plus template-generated kernels and data-packing methods for compact GEMM and compact TRSM, dramatically improving performance and reducing edge-processing waste.

🚀

IATF library on ARMv8

A complete library for fixed-size small GEMM and TRSM on the Kunpeng 920, competitive even with Intel MKL's compact BLAS (measured as % of peak performance).

Overview of the input-aware tuning framework
Figure 1. Overview of IATF: an install-time stage (packing-kernel designer, computing-kernel designer, kernel optimizer) generates highly-optimized kernels; a run-time stage (batch counter, pack selector, execution-plan generator) assembles them into an optimal plan based on the input.
03 / Why small matrices are hard

The Problem with Traditional Methods

Applications from PDE simulations to high-order CFD and ML process huge groups of tiny matrices. Four properties break classic large-GEMM optimizations:

① SIMD underuse

A very small matrix can't fill the width of a SIMD register under traditional layouts.

② Edge overhead

Boundary processing is a large fraction of work for small matrices, not a negligible tail.

③ Tiling is moot

A small matrix fits entirely in L1 cache, so classic multi-level tiling buys nothing.

④ No input-aware tuning

No framework generates high-performance plans across the many small sizes that appear.

SIMD-friendly data layout TRSM tiling method
Figure 2. Left: the SIMD-friendly data layout places the same element of P consecutive matrices contiguously (P=4 for FP32 on Kunpeng 920's 128-bit SIMD), so one vector instruction processes P matrices. Right: the block decomposition of TRSM into triangular solves and rectangular (GEMM-like) blocks.
04 / Approach

SIMD-friendly Layout & Smaller Kernels

Because P matrices fill one SIMD register, IATF can use a much smaller register-level kernel than traditional tiling — which in turn slashes the number of awkward edge blocks.

Traditional tiling of 15x15 DGEMM Compact tiling of 15x15 DGEMM
Figure 3. Tiling a 15×15 DGEMM. Traditional tiling (left) leaves many small edge blocks whose kernels can't fill the SIMD registers — the edge cost can exceed the main kernel. The compact layout (right) uses a small 4×4 kernel that processes 4×4 blocks of 4 matrices at once, minimizing edge cases.
05 / Install-time stage

Kernel Templates, Sizing & Packing

Maximizing the compute-to-memory-access ratio

Kernels are built from 6 reusable templates (I, M1, M2, E, SAVE, SUB) that implement a "ping-pong" schedule — loading the next iteration's data during the current compute to avoid pipeline bubbles. Maximizing CMAR under the 32-register budget (2mc+2nc+mcnc ≤ 32) yields an optimal 4×4 kernel for real GEMM and 3×2 for complex. IATF then generates kernels for every edge size.

Kernel optimization & data packing

Kernel optimizer instruction scheduling
Figure 4. The kernel optimizer reorders instructions to widen the gap between dependent ops and interleaves loads between compute instructions to hide load latency; matrix C is prefetched with ARM's PRFM at kernel entry.
N-shape packing Z-shape packing Triangular packing
Figure 5. Data-packing strategies — N-shape, Z-shape, and triangular (for TRSM). Packing makes kernel memory access contiguous; TRSM stores diagonal elements as reciprocals to avoid costly division in the kernel. A no-packing path is chosen whenever data is already sequentially accessible, saving overhead on the smallest sizes.
06 / Run-time stage

Building the Execution Plan

Step 1

Batch Counter

  • Picks how many matrices to batch per operation
  • Keeps the working set within L1 cache
  • Reserves space for matrix C (GEMM) / triangle (TRSM)
Step 2 & 3

Pack Selector + Plan Generator

  • Chooses the optimal packing kernel — or no-packing
  • Selects the best-matching computing kernel per size
  • Links everything into a high-performance command queue

The run-time plan is generated once per batch, so its overhead is negligible when amortized across a large group of matrices.

07 / Results

Performance on Kunpeng 920

Evaluated on a Kunpeng 920 (ARMv8.2) against OpenBLAS, ARMPL (incl. batched GEMM), and LIBXSMM; square matrices of size 1–33, batch size 16384. Intel MKL's compact BLAS on a Xeon Gold 6240 is included as a % -of-peak reference.

Compact GEMM

SGEMM result DGEMM result CGEMM result ZGEMM result
Figure 6. Compact GEMM (NN mode) vs. ARMPL, LIBXSMM, and OpenBLAS for S/D/C/Z. IATF reaches up to 21×, 7×, 12×, 6× over looped OpenBLAS, and up to 8×, 4×, 8×, 5× over ARMPL batched GEMM (S/D/C/Z respectively).
SGEMM transpose modes DGEMM transpose modes CGEMM transpose modes ZGEMM transpose modes
Figure 7. Compact GEMM across NN, NT, TN, and TT modes — stable, strong performance in every transpose mode and data type.

Compact TRSM

STRSM result DTRSM result CTRSM result ZTRSM result
Figure 8. Compact TRSM (LNLN mode) vs. ARMPL and OpenBLAS. IATF is up to 28×, 12×, 10×, 5× faster than looped OpenBLAS and up to 7×, 5×, 4×, 3× over ARMPL (S/D/C/Z).
STRSM modes DTRSM modes CTRSM modes ZTRSM modes
Figure 9. Compact TRSM across the LNLN, LNUN, LTLN, and LTUN modes — consistently high performance versus OpenBLAS and ARMPL.

Reference vs. Intel MKL (% of peak)

GEMM vs MKL ZGEMM vs MKL TRSM vs MKL ZTRSM vs MKL
Figure 10. As a percentage of processor peak, IATF on Kunpeng 920 is competitive with Intel MKL's compact interface on a Xeon Gold 6240 — with clear advantages on double-precision real and complex GEMM/TRSM. (Kunpeng 920 issues only one memory + one compute op per cycle, which tempers the single-precision advantage.)

Bottom line

Traditional large-GEMM optimizations leave most of the performance on the table for batches of fixed-size small matrices. IATF's SIMD-friendly layout, template-generated edge-aware kernels, smart packing, and input-aware run-time planning deliver up to 21× (GEMM) and 28× (TRSM) over OpenBLAS on ARMv8 — and hold their own against Intel MKL's compact BLAS as a fraction of peak.

08 / Cite

BibTeX

@inproceedings{wei2022iatf,
  title     = {IATF: An Input-Aware Tuning Framework for Compact
               BLAS Based on ARMv8 CPUs},
  author    = {Wei, Cunyang and Jia, Haipeng and Zhang, Yunquan
               and Xu, Liusha and Qi, Ji},
  booktitle = {Proceedings of the 51st International Conference on
               Parallel Processing (ICPP '22)},
  year      = {2022},
  doi       = {10.1145/3545008.3545032}
}