IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUs

01 / Overview

Abstract

Mainstream BLAS libraries deliver high performance on large-scale GEMM and TRSM, but remain insufficient for batch operations on large groups of fixed-size small matrices — a pattern widely used across scientific computing. We propose IATF, an input-aware tuning framework that boosts near-optimal performance for large groups of fixed-size small GEMM and TRSM on the ARMv8 architecture.

IATF has two stages. At install time, using a SIMD-friendly data layout, it builds computing-kernel templates for GEMM and TRSM, derives optimal kernel sizes to raise the computational-instruction ratio, and applies kernel optimizations plus an optimized data-packing strategy to cut memory-access overhead. At run time, it generates an efficient execution plan from the input matrix properties. Experiments show significant improvements in both GEMM and TRSM over other mainstream BLAS libraries.

compact batched BLASauto-tunecode generationGEMMTRSMARMv8 / Kunpeng 920

02 / TL;DR

Key Contributions

🧩

Input-aware tuning framework

A high-performance framework for large groups of small-matrix operations that selects an optimal execution plan based on input matrix properties — size, transpose mode, side, triangle, and unit-diagonal.

⚙️

Kernel & packing design

A SIMD-friendly data layout plus template-generated kernels and data-packing methods for compact GEMM and compact TRSM, dramatically improving performance and reducing edge-processing waste.

🚀

IATF library on ARMv8

A complete library for fixed-size small GEMM and TRSM on the Kunpeng 920, competitive even with Intel MKL's compact BLAS (measured as % of peak performance).

Overview of the input-aware tuning framework — **Figure 1.** Overview of IATF: an install-time stage (packing-kernel designer, computing-kernel designer, kernel optimizer) generates highly-optimized kernels; a run-time stage (batch counter, pack selector, execution-plan generator) assembles them into an optimal plan based on the input.

03 / Why small matrices are hard

The Problem with Traditional Methods

Applications from PDE simulations to high-order CFD and ML process huge groups of tiny matrices. Four properties break classic large-GEMM optimizations:

① SIMD underuse

A very small matrix can't fill the width of a SIMD register under traditional layouts.

② Edge overhead

Boundary processing is a large fraction of work for small matrices, not a negligible tail.

③ Tiling is moot

A small matrix fits entirely in L1 cache, so classic multi-level tiling buys nothing.

④ No input-aware tuning

No framework generates high-performance plans across the many small sizes that appear.

**Figure 2.** Left: the SIMD-friendly data layout places the same element of P consecutive matrices contiguously (P=4 for FP32 on Kunpeng 920's 128-bit SIMD), so one vector instruction processes P matrices. Right: the block decomposition of TRSM into triangular solves and rectangular (GEMM-like) blocks.

TRSM tiling method — **Figure 2.** Left: the SIMD-friendly data layout places the same element of P consecutive matrices contiguously (P=4 for FP32 on Kunpeng 920's 128-bit SIMD), so one vector instruction processes P matrices. Right: the block decomposition of TRSM into triangular solves and rectangular (GEMM-like) blocks.

04 / Approach

SIMD-friendly Layout & Smaller Kernels

Because P matrices fill one SIMD register, IATF can use a much smaller register-level kernel than traditional tiling — which in turn slashes the number of awkward edge blocks.

Traditional tiling of 15x15 DGEMM — **Figure 3.** Tiling a 15×15 DGEMM. Traditional tiling (left) leaves many small edge blocks whose kernels can't fill the SIMD registers — the edge cost can exceed the main kernel. The compact layout (right) uses a small 4×4 kernel that processes 4×4 blocks of 4 matrices at once, minimizing edge cases.

Compact tiling of 15x15 DGEMM — **Figure 3.** Tiling a 15×15 DGEMM. Traditional tiling (left) leaves many small edge blocks whose kernels can't fill the SIMD registers — the edge cost can exceed the main kernel. The compact layout (right) uses a small 4×4 kernel that processes 4×4 blocks of 4 matrices at once, minimizing edge cases.

05 / Install-time stage

Kernel Templates, Sizing & Packing

Maximizing the compute-to-memory-access ratio

Kernels are built from 6 reusable templates (I, M1, M2, E, SAVE, SUB) that implement a "ping-pong" schedule — loading the next iteration's data during the current compute to avoid pipeline bubbles. Maximizing CMAR under the 32-register budget (2m_c+2n_c+m_cn_c ≤ 32) yields an optimal 4×4 kernel for real GEMM and 3×2 for complex. IATF then generates kernels for every edge size.

Kernel optimization & data packing

Kernel optimizer instruction scheduling — **Figure 4.** The kernel optimizer reorders instructions to widen the gap between dependent ops and interleaves loads between compute instructions to hide load latency; matrix C is prefetched with ARM's PRFM at kernel entry.

N-shape packing — **Figure 5.** Data-packing strategies — N-shape, Z-shape, and triangular (for TRSM). Packing makes kernel memory access contiguous; TRSM stores diagonal elements as reciprocals to avoid costly division in the kernel. A **no-packing** path is chosen whenever data is already sequentially accessible, saving overhead on the smallest sizes.

Z-shape packing — **Figure 5.** Data-packing strategies — N-shape, Z-shape, and triangular (for TRSM). Packing makes kernel memory access contiguous; TRSM stores diagonal elements as reciprocals to avoid costly division in the kernel. A **no-packing** path is chosen whenever data is already sequentially accessible, saving overhead on the smallest sizes.

06 / Run-time stage

Building the Execution Plan

Step 1

Batch Counter

Picks how many matrices to batch per operation
Keeps the working set within L1 cache
Reserves space for matrix C (GEMM) / triangle (TRSM)

Step 2 & 3

Pack Selector + Plan Generator

Chooses the optimal packing kernel — or no-packing
Selects the best-matching computing kernel per size
Links everything into a high-performance command queue

The run-time plan is generated once per batch, so its overhead is negligible when amortized across a large group of matrices.

07 / Results

Performance on Kunpeng 920

Evaluated on a Kunpeng 920 (ARMv8.2) against OpenBLAS, ARMPL (incl. batched GEMM), and LIBXSMM; square matrices of size 1–33, batch size 16384. Intel MKL's compact BLAS on a Xeon Gold 6240 is included as a % -of-peak reference.

Compact GEMM

SGEMM result — **Figure 6.** Compact GEMM (NN mode) vs. ARMPL, LIBXSMM, and OpenBLAS for S/D/C/Z. IATF reaches up to **21×, 7×, 12×, 6×** over looped OpenBLAS, and up to **8×, 4×, 8×, 5×** over ARMPL batched GEMM (S/D/C/Z respectively).

DGEMM result — **Figure 6.** Compact GEMM (NN mode) vs. ARMPL, LIBXSMM, and OpenBLAS for S/D/C/Z. IATF reaches up to **21×, 7×, 12×, 6×** over looped OpenBLAS, and up to **8×, 4×, 8×, 5×** over ARMPL batched GEMM (S/D/C/Z respectively).

SGEMM transpose modes — **Figure 7.** Compact GEMM across NN, NT, TN, and TT modes — stable, strong performance in every transpose mode and data type.

DGEMM transpose modes — **Figure 7.** Compact GEMM across NN, NT, TN, and TT modes — stable, strong performance in every transpose mode and data type.

Compact TRSM

STRSM result — **Figure 8.** Compact TRSM (LNLN mode) vs. ARMPL and OpenBLAS. IATF is up to **28×, 12×, 10×, 5×** faster than looped OpenBLAS and up to **7×, 5×, 4×, 3×** over ARMPL (S/D/C/Z).

DTRSM result — **Figure 8.** Compact TRSM (LNLN mode) vs. ARMPL and OpenBLAS. IATF is up to **28×, 12×, 10×, 5×** faster than looped OpenBLAS and up to **7×, 5×, 4×, 3×** over ARMPL (S/D/C/Z).

STRSM modes — **Figure 9.** Compact TRSM across the LNLN, LNUN, LTLN, and LTUN modes — consistently high performance versus OpenBLAS and ARMPL.

DTRSM modes — **Figure 9.** Compact TRSM across the LNLN, LNUN, LTLN, and LTUN modes — consistently high performance versus OpenBLAS and ARMPL.

Reference vs. Intel MKL (% of peak)

GEMM vs MKL — **Figure 10.** As a percentage of processor peak, IATF on Kunpeng 920 is competitive with Intel MKL's compact interface on a Xeon Gold 6240 — with clear advantages on double-precision real and complex GEMM/TRSM. (Kunpeng 920 issues only one memory + one compute op per cycle, which tempers the single-precision advantage.)

ZGEMM vs MKL — **Figure 10.** As a percentage of processor peak, IATF on Kunpeng 920 is competitive with Intel MKL's compact interface on a Xeon Gold 6240 — with clear advantages on double-precision real and complex GEMM/TRSM. (Kunpeng 920 issues only one memory + one compute op per cycle, which tempers the single-precision advantage.)

Bottom line

Traditional large-GEMM optimizations leave most of the performance on the table for batches of fixed-size small matrices. IATF's SIMD-friendly layout, template-generated edge-aware kernels, smart packing, and input-aware run-time planning deliver up to 21× (GEMM) and 28× (TRSM) over OpenBLAS on ARMv8 — and hold their own against Intel MKL's compact BLAS as a fraction of peak.

08 / Cite

BibTeX

@inproceedings{wei2022iatf,
  title     = {IATF: An Input-Aware Tuning Framework for Compact
               BLAS Based on ARMv8 CPUs},
  author    = {Wei, Cunyang and Jia, Haipeng and Zhang, Yunquan
               and Xu, Liusha and Qi, Ji},
  booktitle = {Proceedings of the 51st International Conference on
               Parallel Processing (ICPP '22)},
  year      = {2022},
  doi       = {10.1145/3545008.3545032}
}