FlexFlow Serve

FlexFlow Serve is a low latency and high performance generative large language model (LLM) serving framework built on top of FlexFlow. The high computational and memory requirements of LLMs make it challenging to serve them quickly and cheaply. FlexFlow Serve is an open-source system that includes an automaticed tensor program compiler and an efficient distributed multi-GPU runtime for LLM inference accelaration. FlexFlow Serve provides the following key features:

image

More information about FlexFlow Serve is available at https://flexflow.ai.

Publication