Distributed Speculative Inference of Large Language Models

Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Moshe Wasserblat, Tomer Galanti, Michal Gordon, David Harel

Research output: Contribution to journalConference articlepeer-review

Abstract

Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [Leviathan et al., 2023, Chen et al., 2023, Miao et al., 2023] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI - given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of offthe-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI. Our code is open-sourced: github.com/keyboardAnt/Distributed-Speculative-Inference.

Original languageEnglish
Pages (from-to)336-354
Number of pages19
JournalProceedings of Machine Learning Research
Volume262
StatePublished - 2024
Event4th NeurIPS Efficient Natural Language and Speech Processing Workshop - Vancouver, Canada
Duration: 14 Dec 2024 → …

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Cite this