FlexFlow Serve is a low latency and high performance generative large language model (LLM) serving framework built on top of FlexFlow. The high computational and memory requirements of LLMs make it challenging to serve them quickly and cheaply. FlexFlow Serve is an open-source system that includes an automaticed tensor program compiler and an efficient distributed multi-GPU runtime for LLM inference accelaration. FlexFlow Serve provides the following key features:
CPU Offloading. FlexFlow Serve also offers offloading-based inference for running large models (e.g., llama-7B) on a single GPU. CPU offloading is a choice to save tensors in CPU memory, and only copy the tensor to GPU when doing calculation. Notice that now we selectively offload the largest weight tensors (weights tensor in Linear, Attention).
Quantization. FlexFlow Serve supports int4 and int8 quantization. The compressed tensors are stored on the CPU side. Once copied to the GPU, these tensors undergo decompression and conversion back to their original precision.
More information about FlexFlow Serve is available at https://flexflow.ai.