Poster
in
Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

In-Context Learning behaves as a greedy layer-wise gradient descent algorithm

Brian Chen ⋅ Tianyang Hu ⋅ Hui Jin ⋅ Hwee Lee ⋅ Kenji Kawaguchi

Project Page [ OpenReview]

Abstract

In-context learning (ICL) is a powerful capability of large language models that have shown up in past years. Despite its impact, the exact mechanism behind ICL is still only understood to a very limited capacity. In this paper, we suggest that ICL on a single linearized self-attention layer is equivalent to a single step of gradient descent with a specific dataset. This property is shown without additional assumptions on the model parameters which is required in other work in the field. We then extend our setting to a more realistic multi-layer framework and observe that in-context learning resembles using a greedy-layer-wise algorithm to update the weights within a large language model with multiple layers. Last but not least, we extend our theoretical conclusions to the autoregressive setting. We notice that many other works comparing ICL to gradient descent are restricted to very specific settings that do not contain a causal mask.

Chat is not available.