Timezone: »
Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (\ie, state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an \textit{indirect} metric for evaluating such task understanding from videos. To make a \textit{direct} evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on \textit{spatial, temporal, and causal} understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort would drive the vision community to move onward with goal-oriented video understanding and reasoning.
Author Information
Baoxiong Jia (UCLA)
Ting Lei (Peking University)
Song-Chun Zhu (UCLA)
Siyuan Huang (University of California, Los Angeles)
More from the Same Authors
-
2021 : Theorem-Aware Geometry Problem Solving with Symbolic Reasoning and Theorem Prediction »
Pan Lu · Ran Gong · Shibiao Jiang · Liang Qiu · Siyuan Huang · Xiaodan Liang · Song-Chun Zhu · Ran Gong -
2021 : Towards Diagram Understanding and Cognitive Reasoning in Icon Question Answering »
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · Song-Chun Zhu -
2022 Poster: HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes »
Zan Wang · Yixin Chen · Tengyu Liu · Yixin Zhu · Wei Liang · Siyuan Huang -
2022 Poster: Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning »
Yuanpei Chen · Tianhao Wu · Shengjie Wang · Xidong Feng · Jiechuan Jiang · Zongqing Lu · Stephen McAleer · Hao Dong · Song-Chun Zhu · Yaodong Yang -
2022 : Learn to Select Good Examples with Reinforcement Learning for Semi-structured Mathematical Reasoning »
Pan Lu · Liang Qiu · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Tanmay Rajpurohit · Peter Clark · Ashwin Kalyan -
2022 : Neural-Symbolic Recursive Machine for Systematic Generalization »
Qing Li · Yixin Zhu · Yitao Liang · Ying Nian Wu · Song-Chun Zhu · Siyuan Huang -
2023 : Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models »
Pan Lu · Baolin Peng · Hao Cheng · Michel Galley · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Jianfeng Gao -
2023 Poster: Learning Energy-Based Prior Model with Diffusion-Amortized MCMC »
Peiyu Yu · Yaxuan Zhu · Sirui Xie · Xiaojian (Shawn) Ma · Ruiqi Gao · Song-Chun Zhu · Ying Nian Wu -
2023 Poster: Learning non-Markovian Decision-Making from State-only Sequences »
Aoyang Qin · Feng Gao · Qing Li · Song-Chun Zhu · Sirui Xie -
2023 Poster: Evaluating and Inducing Personality in Pre-trained Language Models »
Guangyuan Jiang · Manjie Xu · Song-Chun Zhu · Wenjuan Han · Chi Zhang · Yixin Zhu -
2023 Poster: Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models »
Pan Lu · Baolin Peng · Hao Cheng · Michel Galley · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Jianfeng Gao -
2023 Poster: Diplomat: A Dialogue Dataset for Situated PragMATic Reasoning »
Hengli Li · Song-Chun Zhu · Zilong Zheng -
2023 Poster: ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab »
Jieming Cui · Ziren Gong · Baoxiong Jia · Siyuan Huang · Zilong Zheng · Jianzhu Ma · Yixin Zhu -
2022 Spotlight: Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning »
Yuanpei Chen · Tianhao Wu · Shengjie Wang · Xidong Feng · Jiechuan Jiang · Zongqing Lu · Stephen McAleer · Hao Dong · Song-Chun Zhu · Yaodong Yang -
2022 Poster: Emergent Graphical Conventions in a Visual Communication Game »
Shuwen Qiu · Sirui Xie · Lifeng Fan · Tao Gao · Jungseock Joo · Song-Chun Zhu · Yixin Zhu -
2022 Poster: MATE: Benchmarking Multi-Agent Reinforcement Learning in Distributed Target Coverage Control »
Xuehai Pan · Mickel Liu · Fangwei Zhong · Yaodong Yang · Song-Chun Zhu · Yizhou Wang -
2022 Poster: Learning Probabilistic Models from Generator Latent Spaces with Hat EBM »
Mitch Hill · Erik Nijkamp · Jonathan Mitchell · Bo Pang · Song-Chun Zhu -
2022 Poster: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering »
Pan Lu · Swaroop Mishra · Tanglin Xia · Liang Qiu · Kai-Wei Chang · Song-Chun Zhu · Oyvind Tafjord · Peter Clark · Ashwin Kalyan -
2020 Poster: Learning Latent Space Energy-Based Prior Model »
Bo Pang · Tian Han · Erik Nijkamp · Song-Chun Zhu · Ying Nian Wu -
2019 : Extended Poster Session »
Travis LaCroix · Marie Ossenkopf · Mina Lee · Nicole Fitzgerald · Daniela Mihai · Jonathon Hare · Ali Zaidi · Alexander Cowen-Rivers · Alana Marzoev · Eugene Kharitonov · Luyao Yuan · Tomasz Korbak · Paul Pu Liang · Yi Ren · Roberto Dessì · Peter Potash · Shangmin Guo · Tatsunori Hashimoto · Percy Liang · Julian Zubek · Zipeng Fu · Song-Chun Zhu · Adam Lerer -
2019 Poster: Learning Perceptual Inference by Contrasting »
Chi Zhang · Baoxiong Jia · Feng Gao · Yixin Zhu · HongJing Lu · Song-Chun Zhu -
2019 Spotlight: Learning Perceptual Inference by Contrasting »
Chi Zhang · Baoxiong Jia · Feng Gao · Yixin Zhu · HongJing Lu · Song-Chun Zhu -
2019 Poster: PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points »
Siyuan Huang · Yixin Chen · Tao Yuan · Siyuan Qi · Yixin Zhu · Song-Chun Zhu -
2019 Poster: Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model »
Erik Nijkamp · Mitch Hill · Song-Chun Zhu · Ying Nian Wu -
2018 Poster: Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation »
Siyuan Huang · Siyuan Qi · Yinxue Xiao · Yixin Zhu · Ying Nian Wu · Song-Chun Zhu -
2013 Poster: Unsupervised Structure Learning of Stochastic And-Or Grammars »
Kewei Tu · Maria Pavlovskaia · Song-Chun Zhu -
2011 Poster: Image Parsing with Stochastic Scene Grammar »
Yibiao Zhao · Song-Chun Zhu