LLM inference on mobile GPUs

Accelerated LLaMA-7B inference on mobile GPUs (Qualcomm Adreno 740) by co-designing computation scheduling and memory-optimization strategies for the two inference phases.

Optimized tall-and-skinny matrix-multiplication kernels for the prefill-phase computational bottleneck, achieving a 4.0x improvement over the CLBlast baseline through sophisticated tiling algorithms and strategic on-chip memory utilization.
Enhanced GEMV efficiency in the decode phase, reaching >90% peak memory-bandwidth utilization via targeted algorithmic and hardware-aware optimizations.