irregular BLAS on CPUs
Oct 2021 – Apr 2023
A line of work on high-performance irregular and batched BLAS for ARM and x86 CPUs, spanning several frameworks.
IrGEMM — Irregular GEMM on ARM and x86 CPUs · TPDS 2024 · PDF
- Generated hundreds of highly optimized assembly kernels for diverse irregular GEMM types based on computing templates, instruction-mapping rules between templates and assembly code, and pipeline optimization strategies.
- Abstracted GEMM tiling into a boxing problem solved with dynamic programming to minimize memory access and maximize the compute-to-memory-access ratio.
- Built a load-balanced multithreaded scheduling framework for batch matrix multiplication on ARMv8 and Intel Cascade Lake architectures.
- Achieved single-threaded speedups of 2.3x / 2.7x / 2.5x and multi-threaded speedups of 3.4x / 14.6x / 14.3x over Intel MKL, ARMPL, LIBXSMM, and BLIS.
IATF — Compact BLAS on ARMv8 · ICPP 2022 · PDF
- Proposed computing-kernel templates for GEMM and TRSM based on a SIMD-friendly data layout, selecting the optimal kernel size and instruction selection from the compute-to-memory-access ratio.
- Designed data-packing kernels so the computing kernel’s memory accesses are contiguous.
- Built an adaptive tuning framework that chooses the batch size from L1 cache and matrix size, and the optimal packing/computing kernels from input matrix properties.
- Achieved up to 4x (GEMM) and 5x (TRSM) speedup over ARMPL in double precision.
LBBGEMM — Load-Balanced Batch GEMM on ARM CPUs · HPCC 2022 · PDF
- Designed high-performance small-GEMM kernels without data packaging to greatly reduce memory-access overhead.
- Presented a load-balanced multi-thread task-scheduling strategy for batch GEMM.
- Achieved DGEMM_Batch speedups of 2.3x on a single thread and 4.2x on 48 threads over ARMPL.
SA_TRSM — Shape-Aware Auto-Tuning for Irregular TRSM · ICPADS 2023
- A shape-aware auto-tuning framework for small-scale irregular-shaped TRSM, automatically selecting kernels and blocking strategies from the triangular-matrix shape to maximize performance.