Spotlight Poster
Time-Reversal Provides Unsupervised Feedback to LLMs
Yerram Varun · Rahul Madhavan · Sravanti Addepalli · Arun Suggala · Karthikeyan Shanmugam · Prateek Jain
West Ballroom A-D #7205
Large Language Models (LLMs) show a remarkable ability to learn in-context from instructions and few shot examples. However, for many tasks such as long form answering, these models could be insufficiently conditioned on the queries. Benchmarks like AlpacaEval evaluate LLM abilities on general question and answering. Given this context, we aim to address the question --- Can we leverage unsupervised feedback using simply the pre-training data to improve generations of language models, given additional inference time compute? To address this question, we introduce Time Reversed Language Models (TRLMs). These models score and generate queries given responses, effectively functioning in the reverse direction compared to regular LLMs. Further, to effectively infer in the response to query direction, we pre-trained a language model (TRLM-Ba) in the reverse token order. Notably, using such a pre-training method, and utilizing the response to query scoring direction, we show that TRLM is able to obtain gains of up to 5% length-controlled win rate gains on the AlpacaEval Leaderboard over the baseline of best-of-N self scoring using log-perplexity. We also show that TRLM scoring outperforms conventional forward scoring significantly, with respect to tasks like citation generation and passage retrieval to short queries on popular datasets. In almost all the cases, TRLM-Ba variant with token reversed pre-training dominates all other variants. Further, we leverage the generative ability of TRLM to augment safety filters on LLMs and demonstrate a drastic reduction in false negative rate without increasing false positive rates when combined with the input filter on JailbreakBench.
Live content is unavailable. Log in and register to view live content