Skip to yearly menu bar Skip to main content

Workshop: Socially Responsible Language Modelling Research (SoLaR)

Learning Inner Monologue and Its Utilization in Vision-Language Challenges

Diji Yang · Kezhen Chen · Jinmeng Rao · Xiaoyuan Guo · Yawen Zhang · Jie Yang · Yi Zhang


Inner monologue is an essential phenomenon for reasoning and insight mining in human cognition. In this work, we propose a novel approach to simulate inner monologue. Specifically, we explore how inner monologue reasoning can be utilized to solve complex vision-language problems. Driven by the power of Large Language Models (LLMs), two prominent methods for vision-language tasks have emerged: (1) the hybrid integration between LLMs and Vision-Language Models (VLMs), where visual inputs are firstly converted into language descriptions by VLMs, serving as inputs for LLMs to generate final answer(s); (2) visual feature alignment in language space, where visual inputs are encoded as embeddings and projected to LLMs' language space via further supervised fine-tuning. The first approach provides light training costs and interpretability but is hard to be optimized in an end-to-end fashion. The second approach presents decent performance, but feature alignment usually requires large amounts of training data and lacks interpretability. With inner monologue simulation, our approach achieves competitive performance with less training data and promising interpretability when compared with state-of-the-art models on two popular tasks.

Chat is not available.