CS seminar: H2O & deja vu: Sparsity for efficient long sequence generation

Warning Icon This event is in the past.

September 19, 2023
11:30 a.m. to 12:30 p.m.
Science Hall #1117
5045 Cass Ave
Detroit, MI 48202
Event category: Seminar


Dr. Beidi Chen, Carnegie Mellon University


LLMs have sparked a new wave of exciting AI applications, but they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. In this talk, I will show how sparsity can help overcome two major bottlenecks in LLM inference, model and KV cache IOs.

First, we present Dejavu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that reduces model weight loading IOs. Dejavu can reduce the inference latency of OPT-175B by over 2x compared to the state-of-the-art FasterTransformer, and over 6$x compared to the widely used Hugging Face implementation, without compromising model quality.

Last, we show Heavy-Hitter Oracle (H2O), a KV cache eviction policy that drastically reduces the memory footprint of these transient states. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores – Heavy-Hitters.  H2O improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29x, 29x, and 3x on OPT-6.7B and OPT-30B. With the same batch size, H_2O can reduce the latency by up to 1.9x.


Dr. Beidi Chen is an Assistant Professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University. She is a Visiting Research Scientist at FAIR, Meta. Before that, she was a postdoctoral scholar at Stanford University. She received her Ph.D. from Rice University in 2020 and B.S. from UC Berkeley in 2015. Her research focuses on efficient machine learning. Specifically, she designs and optimizes algorithms and models on modern hardware to accelerate large machine learning systems. Her work has won a best paper runner-up at ICML 2022, a best paper award at IISA 2018, and a best paper award at USENIX LISA 2014. She was selected as a Rising Star in EECS by MIT in 2019 and UIUC in 2021.

September 2023