Poster
Allegator: Alleviating Attention Bias for Visual-Informed Text Generation
MISO CHOI · Jinyoung Kim · Minseo Yoon · Ji Soo Lee · Hyunwoo Kim
East Exhibit Hall A-C #1510
Large Vision-Language Models (LVLMs) have shown remarkable performances in describing visual information with impressive linguistic ability, powering its diverse application. However, they often generate inaccurate descriptions of visual information, referred to as “hallucination”, therefore resolving this issue remains important for employing LVLMs in real-world scenarios. Although various approaches have been proposed in the literature, mitigating the hallucination in long-form generation remains challenging. We observed the Attention Bias phenomenon in LVLMs, where the model allocates a large amount of attention to a few specific tokens, regardless of inputs. With a thorough analysis of the correlation of Attention Bias with hallucination, we attribute the cause of hallucination to theinternal attention mechanism of Transformers. To ALLEviate hallucination in text GenerATOR (ALLEGATOR), we propose Attention Moderator that refines the attention efficiently in the training stage and Attention Soft-Clipping to guarantee the stable distribution for generating visual-grounded text. We empirically show that our methods enable generating more accurate descriptions by adaptively referring to visuals with sufficient attention. Allegator achieves significant improvements on hallucination benchmarks including POPE, CHAIR and AMBER, especially showing its effectiveness in long-form generation.
Live content is unavailable. Log in and register to view live content