Thu, April 10, 11:30 AM
60 MINUTES
The Art of LLM Compression

Large language models have become the cornerstone of natural language processing, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate models may require multiple performant GPUs, which limits the usability of such models. This leads to a race towards reducing their inference costs to allow for efficient local computation. In this talk, we first discuss the main computational bottlenecks of such models. Then, we present state-of-the-art compression techniques (like quantization and sparsification) to reduce the inference cost of these models in practice. Finally, we discuss existing opportunities for future work.

Saleh Ashkboos

PhD Student @ ETH Zürich

Saleh is a Ph.D. student in the Computer Science Department of ETH Zürich advised by Torsten Hoefler and Dan Alistarh. His research focuses on accelerating deep neural network training and he also works on developing systems and algorithms for large-scale graph processing.