Publications

Download BibTeX.

2025
May
PDF FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze.
MLSys 2025.
2025
May
PDF XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models.
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen.
MLSys 2025.
2025
April
PDF MagicPIG: LSH Sampling for Efficient LLM Generation.
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen.
ICLR 2025 (Spotlight).
2025
April
PDF MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding.
Jian Chen*, Vashisth Tiwari*, Ranajoy Sadhukhan*, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, and Beidi Chen.
ICLR 2025.
2025
April
PDF APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding.
Xinyu Yang, Tianqi Chen, and Beidi Chen.
ICLR 2025.
2025
April
PDF Memory Mosaics.
Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, and Léon Bottou.
ICLR 2025.
2025
April
PDF Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity.
Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, and Zhaozhuo Xu.
ICLR 2025.
2025
March
PDF Relax: Composable Abstractions for End-to-End Dynamic Machine Learning.
Ruihang Lai, Junru Shao, Siyuan Feng, Steven S. Lyubomirsky, Bohan Hou, Wuwei Lin, Zihao Ye, Hongyi Jin, Yuchen Jin, Jiawei Liu, Lesheng Jin, Yaxing Cai, Ziheng Jiang, Yong Wu, Sunghyun Park, Prakalp Srivastava, Jared G. Roesch, Todd C. Mowry, and Tianqi Chen.
ASPLOS 2025.
2025
March
PDF GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism.
Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, and Zhihao Jia.
ASPLOS 2025.
2024
December
PDF Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding.
Zhuoming Chen*, Avner May*, Ruslan Svirschevski*, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen.
NeurIPS 2024 (Spotlight).
2024
December
PDF Sirius: Contextual Sparsity with Correction for Efficient LLMs.
Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, and Beidi Chen.
NeurIPS 2024.
2024
December
PDF S2FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity.
Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, and Beidi Chen.
NeurIPS 2024.
2024
December
PDF Learn To Be Efficient: Build Structured Sparsity in Large Language Models.
Haizhong Zheng, Xiaoyan Bai, Beidi Chen, Fan Lai, and Atul Prakash.
NeurIPS 2024 (Spotlight).
2024
December
PDF SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices.
Ruslan Svirschevski, Avner May, Zhuoming Chen*, Beidi Chen, Zhihao Jia, and Max Ryabinin.
NeurIPS 2024.
2024
December
PDF Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length.
Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou.
NeurIPS 2024.
2024
December
PDF Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training.
Luo Cheng, Jiawei Zhao, Zhuoming Chen, Beidi Chen, and Anima Anandkumar.
NeurIPS 2024.
2024
December
PDF Who Needs Features? On the Surprising Effectiveness of Attention Transfer for Vision Transformers.
Alexander Li, Cong Li, Yuandong Tian, Beidi Chen, Deepak Pathak, and Xinlei Chen.
NeurIPS 2024.
2024
December
PDF Nearest Neighbor Speculative Decoding for LLM Generation and Attribution.
Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, and Xi Victoria Lin.
NeurIPS 2024.
2024
December
PDF Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding.
Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, and Zhangyang Wang.
NeurIPS 2024.
2024
October
PDF TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding.
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen.
COLM 2024.
2024
October
PDF Prompt-Prompted Mixture of Experts for Efficient LLM Generation.
Harry Dong, Beidi Chen, and Yuejie Chi.
COLM 2024.
2024
August
PDF Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding.
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu.
ACL 2024.
2024
July
PDF Galore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian.
ICML 2024 (Oral).
2024
July
PDF Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt.
Zhaozhuo Xu, Zirui Liu, Beidi Chen, Yuxin Tang, Jue Wang, Kaixiong Zhou, Xia Hu, and Anshumali Shrivastava.
ICML 2024.
2024
July
PDF KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu.
ICML 2024.
2024
July
PDF HexGen: Generative Inference of Large Language Model over Heterogeneous Environment.
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan.
ICML 2024.
2024
July
PDF LoCoCo: Dropping In Convolutions for Long Context Compression.
Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen.
ICML 2024.
2024
July
PDF Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference.
Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen.
ICML 2024.
2024
May
PDF Atom: Low-bit Quantization for Efficient and Accurate LLM Serving.
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci.
MLSys 2024.
2024
May
PDF Joma: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention.
Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Du.
ICLR 2024.
2024
May
PDF Efficient Streaming Language Models with Attention Sinks.
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis.
ICLR 2024.
2024
May
PDF Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache.
Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, and Atlas Wang.
MLSys 2024.
2024
May
PDF ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time.
Pratik Fegade, Tianqi Chen, Phillip Gibbons, and Todd Mowry.
MLSys 2024.
2023
December
PDF Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances.
Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia.
NSDI 2024.
2023
December
PDF H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen.
NeurIPS 2023.
2023
December
PDF Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer.
Yuandong Tian, Yiping Wang, Beidi Chen, and Simon Du.
NeurIPS 2023.
2023
December
PDF Laughing Hyena Distillery: Extracting Compact Recurrences from Convolutions.
Stefano Massaroli, Michael Poli, Dan Fu, Hermann Kumbong, Rom Parnichkun, David Romero, Aman Timalsina, Quinn McIntyre, Beidi Chen, Atri Rudra, Ce Zhang, Christopher Re, Stefano Ermon, and Yoshua Bengio.
NeurIPS 2024.
2023
November
PDF SpotServe: Serving Generative Large Language Models on Preemptible Instances.
Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia.
ASPLOS 2024.
2023
July
PDF Fast Algorithms for a New Relaxation of Optimal Transport.
Moses Charikar, Beidi Chen, Christopher Ré, Erik Waingarten.
COLT 2023.
2023
April
PDF SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training.
Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia.
VLDB 2023.
2023
March
PDF EinNet: Optimizing Tensor Programs with Derivation-Based Transformations.
Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, and Zhihao Jia.
OSDI 2023.
2023
March
PDF SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning.
Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze.
ASPLOS 2023.
2023
March
PDF TensorIR: An Abstraction for Automatic Tensorized Program Optimization.
Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen.
ASPLOS 2023.
2022
October
PDF Collage: Seamless Integration of Deep Learning Backends with Automatic Placement.
Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, and Zhihao Jia.
PACT 2022.
2022
September
PDF Tensor Program Optimization with Probabilistic Programs.
Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen.
NeurIPS 2022.
2022
July
PDF Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization.
Zhihao Jia, Colin Unger, Wei Wu, Sina Lin, Mandeep Baines, Vinay Ramakrishnaiah Carlos Efrain, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken.
OSDI 2022.
2022
June
PDF Quartz: Superoptimization of Quantum Circuits.
Mingkuan Xu, Zikun Li, Oded Padon, Sina Lin, Jessica Pointing, Auguste Hirth, Henry Ma, Jens Palsberg, Alex Aiken, Umut A. Acar, and Zhihao Jia.
PLDI 2022.
2022
April
PDF GradSign: Model Performance Inference with Theoretical Insights.
Zhihao Zhang and Zhihao Jia.
ICLR 2022.
2022
March
PDF The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding.
Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, and Todd C. Mowry.
MLSys 2022.
2022
March
PDF DietCode: Automatic Optimization for Dynamic Tensor Programs.
Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko.
MLSys 2022.
2021
July
PDF PET: Optimizing Tensor Programs with Partially Equivalent Transformation and Automated Correction.
Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia.
OSDI 2021.
2021
April
PDF IOS: Inter-Operator Scheduler for CNN Acceleration.
Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han.
MLSys 2021.
2021
April
PDF Cortex: A Compiler for Recursive Deep Learning Models.
Pratik Fegade, Tianqi Chen, Phil Gibbons, and Todd Mowry.
MLSys 2021.
2020
August
PDF Redundancy-free computation graphs for graph neural networks.
Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, and Alex Aiken.
KDD 2020.
2020
March
PDF Improving the accuracy, scalability, and performance of graph neural networks with roc.
Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken.
MLSys 2020.
2020
February
PDF Automating Generation of Low Precision Deep Learning Operators.
Meghan Cowan, Thierry Moreau, Tianqi Chen, and Luis Ceze.
CGO.
2019
November
PDF TASO: optimizing deep learning computation with automatic generation of graph substitutions.
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken.
SOSP 2019.
2019
September
PDF A Hardware-Software Blueprint for Flexible Deep Learning Specialization.
Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.
IEEE Micro 39(5).
2019
April
PDF Beyond data and model parallelism for deep neural networks.
Zhihao Jia, Matei Zaharia, and Alex Aiken.
SysML 2019.
2019
April
PDF Optimizing DNN Computation with Relaxed Graph Substitutions.
Zhihao Jia, James Thomas, Todd Warzawski, Mingyu Gao, Matei Zaharia, and Alex Aiken.
SysML 2019.
2018
December
PDF Learning to Optimize Tensor Programs.
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.
NeurIPS 2018.
2018
October
PDF TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.
OSDI 2018.
2018
July
PDF Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks.
Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken.
ICML 2018 (Proceedings of Machine Learning Research).
2017
November
PDF A Distributed Multi-GPU System for Fast Graph Processing.
Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez, and Alex Aiken.
VLDB 11(3).
2016
August
PDF XGBoost: A Scalable Tree Boosting System.
Tianqi Chen and Carlos Guestrin.
KDD 2016.
2015
December
PDF MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems.
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang.
LearningSys Workshop at Neural Information Processing Systems 2015.