Mirage Persistent Kernel (MPK)

MPK automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs all necessary computation and communication within a single kernel launch. MPK uses an SM-level graph representation to capture dependencies at the streaming multiprocessor granularity, enabling advanced GPU optimizations like cross-operator pipelining and kernel overlap. The compiler generates optimized CUDA code while the runtime executes tasks within a single kernel using decentralized scheduling.

MPK reduces inference latency by up to 1.7x compared to conventional kernel-per-operator systems, and compiles LLMs from Hugging Face using only dozens of lines of Python.

Reference Paper

Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, and Zhihao Jia. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs. Preprint arXiv:2512.22219.

Mirage Persistent Kernel (MPK)

Reference Paper

Resources