LLM inference on mobile GPUs
Jul 2023 – Jan 2024
Accelerated LLaMA-7B inference on mobile GPUs (Qualcomm Adreno 740) by co-designing computation scheduling and memory-optimization strategies for the two inference phases.
- Optimized tall-and-skinny matrix-multiplication kernels for the prefill-phase computational bottleneck, achieving a 4.0x improvement over the CLBlast baseline through sophisticated tiling algorithms and strategic on-chip memory utilization.
- Enhanced GEMV efficiency in the decode phase, reaching >90% peak memory-bandwidth utilization via targeted algorithmic and hardware-aware optimizations.