Timezone: »
Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g arithmetic, calculus, (ii) language format e.g. question-answering, fill-in-the-blanks, (iii) language diversity e.g. no language, simple language, (iv) external knowledge e.g. commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA and its variants, a family of mathematical reasoning models fine-tuned on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.
Author Information
Swaroop Mishra (Arizona State University)
Matthew Finlayson (AI2)
Matthew Finlayson is a pre-doctoral investigator at the Allen Institute for AI. He completed his Bachelors in Computer Science and Linguistics from Harvard in 2021.
Pan Lu (UCLA; AI2)
Leonard Tang (Harvard University)
Sean Welleck (University of Washington)
Chitta Baral (Arizona State University)
Tanmay Rajpurohit (Georgia Institute of Technology)
Oyvind Tafjord (Allen Institute for AI)
Ashish Sabharwal (Allen Institute for AI)
Peter Clark (Allen Institute for AI)
Ashwin Kalyan (AI2)
More from the Same Authors
-
2021 : IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning »
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · Song-Chun Zhu -
2021 : Theorem-Aware Geometry Problem Solving with Symbolic Reasoning and Theorem Prediction »
Pan Lu · Ran Gong · Shibiao Jiang · Liang Qiu · Siyuan Huang · Xiaodan Liang · Song-Chun Zhu · Ran Gong -
2021 : Towards Grounded Natural Language Proof Generation »
Sean Welleck · Jiacheng Liu · Yejin Choi -
2021 : Towards Diagram Understanding and Cognitive Reasoning in Icon Question Answering »
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · Song-Chun Zhu -
2021 : PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures »
Dan Hendrycks · Andy Zou · Mantas Mazeika · Leonard Tang · Dawn Song · Jacob Steinhardt -
2022 : Benchmarking Counterfactual Reasoning Abilities about Implicit Physical Properties »
Maitreya Patel · Tejas Gokhale · Chitta Baral · 'YZ' Yezhou Yang -
2022 : Estimating Numbers without Regression »
Avijit Thawani · Jay Pujara · Ashwin Kalyan -
2022 : Learn to Select Good Examples with Reinforcement Learning for Semi-structured Mathematical Reasoning »
Pan Lu · Liang Qiu · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Tanmay Rajpurohit · Peter Clark · Ashwin Kalyan -
2022 : Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs »
Albert Jiang · Sean Welleck · Jin Peng Zhou · Timothee Lacroix · Jiacheng Liu · Wenda Li · Mateja Jamnik · Guillaume Lample · Yuhuai Wu -
2022 : ContextNER: Contextual Phrase Generation at Scale »
Himanshu Gupta · Shreyas Verma · Tarun Kumar · Swaroop Mishra · Tamanna Agrawal · Amogh Badugu · Himanshu Bhatt -
2022 : Towards Systematic Reasoning with Language Models »
Peter Clark -
2022 Workshop: MATH-AI: Toward Human-Level Mathematical Reasoning »
Pan Lu · Swaroop Mishra · Sean Welleck · Yuhuai Wu · Hannaneh Hajishirzi · Percy Liang -
2022 Poster: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering »
Pan Lu · Swaroop Mishra · Tanglin Xia · Liang Qiu · Kai-Wei Chang · Song-Chun Zhu · Oyvind Tafjord · Peter Clark · Ashwin Kalyan -
2021 : Poster Session 1 »
Jiaqi Chen · Tanglin Xia · Sean Welleck · Jiacheng Liu · Ran Gong · Shifeng Huang · Wei Yu · Tracy Jia Shen -
2021 Workshop: Math AI for Education (MATHAI4ED): Bridging the Gap Between Research and Smart Education »
Pan Lu · Yuhuai Wu · Sean Welleck · Xiaodan Liang · Eric Xing · James McClelland -
2020 : VAIDA: An Educative Benchmark Creation Paradigm using Visual Analytics for Interactively Discouraging Artifacts (by Anjana Arunkumar, Swaroop Mishra, Bhavdeep Sachdeva, Chitta Baral and Chris Bryan) »
Anjana Arunkumar · Swaroop Mishra · Chitta Baral -
2020 Poster: Belief Propagation Neural Networks »
Jonathan Kuck · Shuvam Chakraborty · Hao Tang · Rachel Luo · Jiaming Song · Ashish Sabharwal · Stefano Ermon -
2020 Poster: Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge »
Alon Talmor · Oyvind Tafjord · Peter Clark · Yoav Goldberg · Jonathan Berant -
2020 Spotlight: Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge »
Alon Talmor · Oyvind Tafjord · Peter Clark · Yoav Goldberg · Jonathan Berant -
2020 Poster: Language-Conditioned Imitation Learning for Robot Manipulation Tasks »
Simon Stepputtis · Joseph Campbell · Mariano Phielipp · Stefan Lee · Chitta Baral · Heni Ben Amor -
2020 Spotlight: Language-Conditioned Imitation Learning for Robot Manipulation Tasks »
Simon Stepputtis · Joseph Campbell · Mariano Phielipp · Stefan Lee · Chitta Baral · Heni Ben Amor -
2019 Poster: Approximating the Permanent by Sampling from Adaptive Partitions »
Jonathan Kuck · Tri Dao · Hamid Rezatofighi · Ashish Sabharwal · Stefano Ermon -
2018 Poster: Expanding Holographic Embeddings for Knowledge Completion »
Yexiang Xue · Yang Yuan · Zhitian Xu · Ashish Sabharwal -
2016 Poster: Adaptive Concentration Inequalities for Sequential Decision Problems »
Shengjia Zhao · Enze Zhou · Ashish Sabharwal · Stefano Ermon -
2014 Workshop: 4th Workshop on Automated Knowledge Base Construction (AKBC) »
Sameer Singh · Fabian M Suchanek · Sebastian Riedel · Partha Pratim Talukdar · Kevin Murphy · Christopher RĂ© · William Cohen · Tom Mitchell · Andrew McCallum · Jason E Weston · Ramanathan Guha · Boyan Onyshkevych · Hoifung Poon · Oren Etzioni · Ari Kobren · Arvind Neelakantan · Peter Clark