Timezone: »
Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e. 49.5\% using textual explanations and 48.5\% using automatically
Author Information
Jialin Wu (UT Austin)
Raymond Mooney (University of Texas at Austin)
Related Events (a corresponding poster, oral, or spotlight)
-
2019 Spotlight: Self-Critical Reasoning for Robust Visual Question Answering »
Thu. Dec 12th 12:50 -- 12:55 AM Room West Ballroom A + B
More from the Same Authors
-
2022 : Zero-shot Video Moment Retrieval With Off-the-Shelf Models »
Anuj Diwan · Puyuan Peng · Raymond Mooney -
2022 : Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks »
Albert Yu · Raymond Mooney -
2022 : Language-guided Task Adaptation for Imitation Learning »
Prasoon Goyal · Raymond Mooney · Scott Niekum -
2019 : Poster session »
Candace Ross · Yassine Mrabet · Sanjay Subramanian · Geoffrey Cideron · Jesse Mu · Suvrat Bhooshan · Eda Okur Kavil · Jean-Benoit Delbrouck · Yen-Ling Kuo · Nicolas Lair · Gabriel Ilharco · T.S. Jayram · Alba MarĂa Herrera Palacio · Chihiro Fujiyama · Olivier Tieleman · Anna Potapenko · Guan-Lin Chao · Thomas Sutter · Olga Kovaleva · Farley Lai · Xin Wang · Vasu Sharma · Catalina Cangea · Nikhil Krishnaswamy · Yuta Tsuboi · Alexander Kuhnle · Khanh Nguyen · Dian Yu · Homagni Saha · Jiannan Xiang · Vijay Venkataraman · Ankita Kalra · Ning Xie · Derek Doran · Travis Goodwin · Asim Kadav · Shabnam Daghaghi · Jason Baldridge · Jialin Wu · Jingxiang Lin · Unnat Jain -
2018 : Learning to Understand Natural Language Instructions through Human-Robot Dialog »
Raymond Mooney -
2017 : Panel Discussion »
Felix Hill · Olivier Pietquin · Jack Gallant · Raymond Mooney · Sanja Fidler · Chen Yu · Devi Parikh -
2017 : Visually Grounded Language: Past, Present, and Future... »
Raymond Mooney -
2015 : Generating Natural-Language Video Descriptions using LSTM Recurrent Neural Networks »
Raymond Mooney -
2011 Workshop: Integrating Language and Vision »
Raymond Mooney · Trevor Darrell · Kate Saenko