Skip to yearly menu bar Skip to main content


Poster

Stealth edits to large language models

Oliver Sutton · Qinghua Zhou · Wei Wang · Desmond Higham · Alexander N Gorban · Alexander Bastounis · Ivan Tyukin

East Exhibit Hall A-C #4410
[ ] [ Project Page ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We present a computationally efficient new method for selectively editing large language models without retraining. This offers a potential way forward for correcting hallucinations, but also reveals a previously unrecognised vulnerability in many state of the art families of large language models. At the heart of the method is a mechanism to harness the inherent non-linearity and high dimensionality in the large language model to directly control the selectivity of the edit. Surprisingly, this can be done without accessing the knowledge stored in the language model through training, and does not require modifying the original training set. We reveal a fundamental metric for determining the ability to edit a specific language model. This metric is defined by a separability-based notion of the intrinsic dimension of the model's feature space. Extensive experimental results illustrate and support the method and its theoretical underpinnings.

Live content is unavailable. Log in and register to view live content