NeurIPS Expo Demonstration Parallel generation with verification on device

Expo Demonstration

Parallel generation with verification on device

Ron Tindall

[ Abstract ]

Abstract:

In this work, we address the challenges of efficiently generating and verifying multiple responses from large language models (LLMs) directly on device. While sampling with non-zero temperature often yields improved responses compared to greedy approaches, selecting the best response requires generating several candidates and evaluating them without incurring significant latency or resource overhead. Cloud-based solutions often rely on separate verification models, which are impractical for on-device deployment due to resource constraints. Our proposed solution leverages multi-stream execution graphs and parallel LLM generation, enabling joint generation and verification within a unified framework. Combined with post-processing techniques such as majority voting, this approach minimizes latency and optimizes the selection of high-quality responses, paving the way for more effective on-device LLM inference. x000D
x000D
Specific challenge that we tackle (research/implementation-wise)  x000D
x000D
Using non-zero temperature sampling with language models can result in higher-quality responses compared to greedy sampling, although this is not always assured. Achieving optimal output often requires generating multiple candidate responses and selecting the most suitable one for the user. This technique is widely adopted to enhance inference-time performance. When implemented on device, however, it presents two primary challenges: minimizing the latency associated with generating several responses and determining a resource-efficient method for selecting the best response from the generated set.

Chat is not available.