Multimodal Learning and Reasoning for Visual Question Answering
Ilija Ilievski · Jiashi Feng

Tue Dec 5th 06:30 -- 10:30 PM @ Pacific Ballroom #80 #None

Reasoning about entities and their relationships from multimodal data is a key goal of Artificial General Intelligence. The visual question answering (VQA) problem is an excellent way to test such reasoning capabilities of an AI model and its multimodal representation learning. However, the current VQA models are over-simplified deep neural networks, comprised of a long short-term memory (LSTM) unit for question comprehension and a convolutional neural network (CNN) for learning single image representation. We argue that the single visual representation contains a limited and general information about the image contents and thus limits the model reasoning capabilities. In this work we introduce a modular neural network model that learns a multimodal and multifaceted representation of the image and the question. The proposed model learns to use the multimodal representation to reason about the image entities and achieves a new state-of-the-art performance on both VQA benchmark datasets, VQA v1.0 and v2.0, by a wide margin.

Author Information

Ilija Ilievski (National University of Singapore)

Ilija is a machine learning researcher building holistic models of unstructured data from multiple modalities. His diverse, six-year experience as a machine learning researcher includes projects on combing satellite images and census data for complex city models, utilizing movie metadata and watch statistics for recommender systems, and fusing image and text data representations for visual question answering. Currently Ilija is working on developing a unified model of financial data coming from multiple sources applied to portfolio optimization.

Jiashi Feng (National University of Singapore)

More from the Same Authors