Mirage Persistent Kernel (MPK)

MPK automatically transforms LLM inference into a single megakernel — a fused GPU kernel that performs all necessary computation and communication within a single kernel launch. MPK uses an SM-level graph representation to capture dependencies at the streaming multiprocessor granularity, enabling advanced GPU optimizations like cross-operator pipelining and kernel overlap. The compiler generates optimized CUDA code while the runtime executes tasks within a single kernel using decentralized scheduling.

MPK reduces inference latency by up to 1.7x compared to conventional kernel-per-operator systems, and compiles LLMs from Hugging Face using only dozens of lines of Python.

Reference Paper

Resources