Timezone: »
Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted from labeled music videos in AudioSet (Gemmeke et al., 2017) with the goal of systematically evaluating if multimodal transformers can perform complex reasoning to recognize new concepts as negation of previously learned concepts. We show that with standard fine-tuning approach multimodal transformers are still incapable of correctly interpreting negation irrespective of model size. However, our experiments demonstrate that augmenting the original training task distributions with negated QA examples allow the model to reliably reason with negation. To do this, we describe a novel data generation procedure that prompts the 540B-parameter PaLM model to automatically generate negated QA examples as compositions of easily accessible video tags. The generated examples contain more natural linguistic patterns and the gains compared to template-based task augmentation approach are significant.
Author Information
Yue Li (Google Research)
NLPer with research interest in few shot learning, domain adaptation and joint music-language learning
Aren Jansen (Google, Inc.)
Qingqing Huang (MIT)
Ravi Ganti (Google)
Joonseok Lee (Google Research)
Joonseok Lee is a research engineer at Google Research. He is mainly working on content-based video recommendation and multi-modal video representation learning. He earned his Ph. D. in Computer Science from Georgia Institute of Technology in August 2015, under the supervision of Dr. Guy Lebanon and Prof. Hongyuan Zha. His thesis is about local approaches for collaborative filtering, with recommendation systems as the main application. He has done three internships during Ph.D, including Amazon (2014 Summer), Microsoft Research (2014 Spring), and Google (2013 Summer). Before coming to Georgia Tech, he worked in NHN corp. in Korea (2007-2010). He received his B.S degree in computer science and engineering from Seoul National University, Korea. His paper "Local Collaborative Ranking" received the best student paper award from the ACM WWW (2014) and IEEE ICDM (2016) conference. He co-organized the YouTube-8M Large-Scale Video Understanding Workshop as a program chair since 2017, and served as the publicity chair for AISTATS 2015 conference. He has served as a program committee in many conferences including NIPS, ICML, ICLR, AAAI, CVPR, I/ECCV, WSDM, and CIKM, and journals including JMLR, ACM TIST, and IEEE TKDE. More information is available in his website (http://www.joonseok.net).
Dima Kuzmin (Google)
More from the Same Authors
-
2021 : Deep-DFT: Physics-ML hybrid method to predict DFT energy using Transformer »
Youngwoo Cho · Seunghoon Yi · Jaegul Choo · Joonseok Lee · Sookyung Kim -
2023 Poster: Mr. Sum: Large-scale Video Summarization Dataset and Benchmark »
Jinhwan Sul · Jihoon Han · Joonseok Lee -
2023 Poster: VisAlign: Dataset for Measuring the Degree of Alignment between AI and Humans in Visual Perception »
Jiyoung Lee · Seungho Kim · Seunghyun Won · Joonseok Lee · Marzyeh Ghassemi · James Thorne · Jaeseok Choi · O-Kil Kwon · Edward Choi -
2021 Poster: Attention Bottlenecks for Multimodal Fusion »
Arsha Nagrani · Shan Yang · Anurag Arnab · Aren Jansen · Cordelia Schmid · Chen Sun -
2017 : Towards Learning Semantic Audio Representations from Unlabeled Data »
Aren Jansen -
2017 : Ravi Ganti (Walmart Labs) on Exploiting Structure in Large Scale Bandit Problems »
Ravi Ganti -
2016 Demonstration: Content-based Related Video Recommendations »
Joonseok Lee -
2015 Poster: Matrix Completion Under Monotonic Single Index Models »
Ravi Ganti · Laura Balzano · Rebecca Willett -
2015 Poster: Super-Resolution Off the Grid »
Qingqing Huang · Sham Kakade -
2015 Spotlight: Super-Resolution Off the Grid »
Qingqing Huang · Sham Kakade -
2012 Poster: Automatic Feature Induction for Stagewise Collaborative Filtering »
Joonseok Lee · Mingxuan Sun · Seungyeon Kim · Guy Lebanon