2025
May
|
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze.
MLSys 2025.
|
2025
May
|
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models.
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen.
MLSys 2025.
|
2025
April
|
MagicPIG: LSH Sampling for Efficient LLM Generation.
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen.
ICLR 2025
(Spotlight).
|
2025
April
|
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding.
Jian Chen*, Vashisth Tiwari*, Ranajoy Sadhukhan*, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, and Beidi Chen.
ICLR 2025.
|
2025
April
|
APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding.
Xinyu Yang, Tianqi Chen, and Beidi Chen.
ICLR 2025.
|
2025
April
|
Memory Mosaics.
Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, and Léon Bottou.
ICLR 2025.
|
2025
April
|
Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity.
Wentao Guo, Jikai Long, Yimeng Zeng, Zirui Liu, Xinyu Yang, Yide Ran, Jacob R. Gardner, Osbert Bastani, Christopher De Sa, Xiaodong Yu, Beidi Chen, and Zhaozhuo Xu.
ICLR 2025.
|
2025
March
|
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning.
Ruihang Lai, Junru Shao, Siyuan Feng, Steven S. Lyubomirsky, Bohan Hou, Wuwei Lin, Zihao Ye, Hongyi Jin, Yuchen Jin, Jiawei Liu, Lesheng Jin, Yaxing Cai, Ziheng Jiang, Yong Wu, Sunghyun Park, Prakalp Srivastava, Jared G. Roesch, Todd C. Mowry, and Tianqi Chen.
ASPLOS 2025.
|
2025
March
|
GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism.
Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao, Mohammad Alizadeh, Gregory R. Ganger, Tianqi Chen, and Zhihao Jia.
ASPLOS 2025.
|
2024
December
|
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding.
Zhuoming Chen*, Avner May*, Ruslan Svirschevski*, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen.
NeurIPS 2024
(Spotlight).
|
2024
December
|
Sirius: Contextual Sparsity with Correction for Efficient LLMs.
Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, and Beidi Chen.
NeurIPS 2024.
|
2024
December
|
S2FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity.
Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, and Beidi Chen.
NeurIPS 2024.
|
2024
December
|
Learn To Be Efficient: Build Structured Sparsity in Large Language Models.
Haizhong Zheng, Xiaoyan Bai, Beidi Chen, Fan Lai, and Atul Prakash.
NeurIPS 2024
(Spotlight).
|
2024
December
|
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices.
Ruslan Svirschevski, Avner May, Zhuoming Chen*, Beidi Chen, Zhihao Jia, and Max Ryabinin.
NeurIPS 2024.
|
2024
December
|
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length.
Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou.
NeurIPS 2024.
|
2024
December
|
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training.
Luo Cheng, Jiawei Zhao, Zhuoming Chen, Beidi Chen, and Anima Anandkumar.
NeurIPS 2024.
|
2024
December
|
Who Needs Features? On the Surprising Effectiveness of Attention Transfer for Vision Transformers.
Alexander Li, Cong Li, Yuandong Tian, Beidi Chen, Deepak Pathak, and Xinlei Chen.
NeurIPS 2024.
|
2024
December
|
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution.
Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, and Xi Victoria Lin.
NeurIPS 2024.
|
2024
December
|
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding.
Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, and Zhangyang Wang.
NeurIPS 2024.
|
2024
October
|
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding.
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen.
COLM 2024.
|
2024
October
|
Prompt-Prompted Mixture of Experts for Efficient LLM Generation.
Harry Dong, Beidi Chen, and Yuejie Chi.
COLM 2024.
|
2024
August
|
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding.
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu.
ACL 2024.
|
2024
July
|
Galore: Memory-Efficient LLM Training by Gradient Low-Rank Projection.
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian.
ICML 2024
(Oral).
|
2024
July
|
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt.
Zhaozhuo Xu, Zirui Liu, Beidi Chen, Yuxin Tang, Jue Wang, Kaixiong Zhou, Xia Hu, and Anshumali Shrivastava.
ICML 2024.
|
2024
July
|
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu.
ICML 2024.
|
2024
July
|
HexGen: Generative Inference of Large Language Model over Heterogeneous Environment.
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan.
ICML 2024.
|
2024
July
|
LoCoCo: Dropping In Convolutions for Long Context Compression.
Ruisi Cai, Yuandong Tian, Zhangyang Wang, and Beidi Chen.
ICML 2024.
|
2024
July
|
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference.
Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen.
ICML 2024.
|
2024
May
|
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving.
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci.
MLSys 2024.
|
2024
May
|
Joma: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention.
Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Du.
ICLR 2024.
|
2024
May
|
Efficient Streaming Language Models with Attention Sinks.
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis.
ICLR 2024.
|
2024
May
|
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache.
Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, and Atlas Wang.
MLSys 2024.
|
2024
May
|
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time.
Pratik Fegade, Tianqi Chen, Phillip Gibbons, and Todd Mowry.
MLSys 2024.
|
2023
December
|
Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances.
Jiangfei Duan, Ziang Song, Xupeng Miao, Xiaoli Xi, Dahua Lin, Harry Xu, Minjia Zhang, and Zhihao Jia.
NSDI 2024.
|
2023
December
|
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen.
NeurIPS 2023.
|
2023
December
|
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer.
Yuandong Tian, Yiping Wang, Beidi Chen, and Simon Du.
NeurIPS 2023.
|
2023
December
|
Laughing Hyena Distillery: Extracting Compact Recurrences from Convolutions.
Stefano Massaroli, Michael Poli, Dan Fu, Hermann Kumbong, Rom Parnichkun, David Romero, Aman Timalsina, Quinn McIntyre, Beidi Chen, Atri Rudra, Ce Zhang, Christopher Re, Stefano Ermon, and Yoshua Bengio.
NeurIPS 2024.
|
2023
November
|
SpotServe: Serving Generative Large Language Models on Preemptible Instances.
Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia.
ASPLOS 2024.
|
2023
July
|
Fast Algorithms for a New Relaxation of Optimal Transport.
Moses Charikar, Beidi Chen, Christopher Ré, Erik Waingarten.
COLT 2023.
|
2023
April
|
SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training.
Xupeng Miao, Yining Shi, Zhi Yang, Bin Cui, and Zhihao Jia.
VLDB 2023.
|
2023
March
|
EinNet: Optimizing Tensor Programs with Derivation-Based Transformations.
Liyan Zheng, Haojie Wang, Jidong Zhai, Muyan Hu, Zixuan Ma, Tuowei Wang, Shuhong Huang, Xupeng Miao, Shizhi Tang, Kezhao Huang, and Zhihao Jia.
OSDI 2023.
|
2023
March
|
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning.
Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze.
ASPLOS 2023.
|
2023
March
|
TensorIR: An Abstraction for Automatic Tensorized Program Optimization.
Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen.
ASPLOS 2023.
|
2022
October
|
Collage: Seamless Integration of Deep Learning Backends with Automatic Placement.
Byungsoo Jeon, Sunghyun Park, Peiyuan Liao, Sheng Xu, Tianqi Chen, and Zhihao Jia.
PACT 2022.
|
2022
September
|
Tensor Program Optimization with Probabilistic Programs.
Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Ruihang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen.
NeurIPS 2022.
|
2022
July
|
Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization.
Zhihao Jia, Colin Unger, Wei Wu, Sina Lin, Mandeep Baines, Vinay Ramakrishnaiah Carlos Efrain, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken.
OSDI 2022.
|
2022
June
|
Quartz: Superoptimization of Quantum Circuits.
Mingkuan Xu, Zikun Li, Oded Padon, Sina Lin, Jessica Pointing, Auguste Hirth, Henry Ma, Jens Palsberg, Alex Aiken, Umut A. Acar, and Zhihao Jia.
PLDI 2022.
|
2022
April
|
GradSign: Model Performance Inference with Theoretical Insights.
Zhihao Zhang and Zhihao Jia.
ICLR 2022.
|
2022
March
|
The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding.
Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, and Todd C. Mowry.
MLSys 2022.
|
2022
March
|
DietCode: Automatic Optimization for Dynamic Tensor Programs.
Bojian Zheng, Ziheng Jiang, Cody Hao Yu, Haichen Shen, Joshua Fromm, Yizhi Liu, Yida Wang, Luis Ceze, Tianqi Chen, and Gennady Pekhimenko.
MLSys 2022.
|
2021
July
|
PET: Optimizing Tensor Programs with Partially Equivalent Transformation and Automated Correction.
Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia.
OSDI 2021.
|
2021
April
|
IOS: Inter-Operator Scheduler for CNN Acceleration.
Yaoyao Ding, Ligeng Zhu, Zhihao Jia, Gennady Pekhimenko, and Song Han.
MLSys 2021.
|
2021
April
|
Cortex: A Compiler for Recursive Deep Learning Models.
Pratik Fegade, Tianqi Chen, Phil Gibbons, and Todd Mowry.
MLSys 2021.
|
2020
August
|
Redundancy-free computation graphs for graph neural networks.
Zhihao Jia, Sina Lin, Rex Ying, Jiaxuan You, Jure Leskovec, and Alex Aiken.
KDD 2020.
|
2020
March
|
Improving the accuracy, scalability, and performance of graph neural networks with roc.
Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken.
MLSys 2020.
|
2020
February
|
Automating Generation of Low Precision Deep Learning Operators.
Meghan Cowan, Thierry Moreau, Tianqi Chen, and Luis Ceze.
CGO.
|
2019
November
|
TASO: optimizing deep learning computation with automatic generation of graph substitutions.
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken.
SOSP 2019.
|
2019
September
|
A Hardware-Software Blueprint for Flexible Deep Learning Specialization.
Thierry Moreau, Tianqi Chen, Luis Vega, Jared Roesch, Eddie Yan, Lianmin Zheng, Josh Fromm, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.
IEEE Micro 39(5).
|
2019
April
|
Beyond data and model parallelism for deep neural networks.
Zhihao Jia, Matei Zaharia, and Alex Aiken.
SysML 2019.
|
2019
April
|
Optimizing DNN Computation with Relaxed Graph Substitutions.
Zhihao Jia, James Thomas, Todd Warzawski, Mingyu Gao, Matei Zaharia, and Alex Aiken.
SysML 2019.
|
2018
December
|
Learning to Optimize Tensor Programs.
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.
NeurIPS 2018.
|
2018
October
|
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy.
OSDI 2018.
|
2018
July
|
Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks.
Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken.
ICML 2018 (Proceedings of Machine Learning Research).
|
2017
November
|
A Distributed Multi-GPU System for Fast Graph Processing.
Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez, and Alex Aiken.
VLDB 11(3).
|
2016
August
|
XGBoost: A Scalable Tree Boosting System.
Tianqi Chen and Carlos Guestrin.
KDD 2016.
|
2015
December
|
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems.
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang.
LearningSys Workshop at Neural Information Processing Systems 2015.
|