Timezone: »

Workshop
MATH-AI: Toward Human-Level Mathematical Reasoning
Pan Lu · Swaroop Mishra · Sean Welleck · Yuhuai Wu · Hannaneh Hajishirzi · Percy Liang

Sat Dec 03 06:55 AM -- 03:00 PM (PST) @ Room 293 - 294

Mathematical reasoning is a unique aspect of human intelligence and a fundamental building block for scientific and intellectual pursuits. However, learning mathematics is often a challenging human endeavor that relies on expert instructors to create, teach and evaluate mathematical material. From an educational perspective, AI systems that aid in this process offer increased inclusion and accessibility, efficiency, and understanding of mathematics. Moreover, building systems capable of understanding, creating, and using mathematics offers a unique setting for studying reasoning in AI. This workshop will investigate the intersection of mathematics education and AI.

 Sat 6:55 a.m. - 7:00 a.m. Introduction and Opening Remarks (Opening Remarks) 🔗 Sat 7:00 a.m. - 7:30 a.m. Reasoning and Abstraction as Challenges for AI (Invited Talk) Cezary Kaliszyk 🔗 Sat 7:30 a.m. - 8:00 a.m. Length Generalization in Quantitative Reasoning (Invited Talk) Behnam Neyshabur 🔗 Sat 8:00 a.m. - 8:30 a.m. Has Progress on Math been Surprising? (Invited Talk) In 2021, we commissioned forecasters to predict progress on ML benchmarks, including the MATH dataset for mathematical problem-solving. Progress on MATH ended up being much faster than predicted. I'll discuss what we should and shouldn't take away from this, my own predictions for future progress, and general implications for predicting future developments in ML. Jacob Steinhardt 🔗 Sat 8:30 a.m. - 10:00 a.m. Poster Session 🔗 Sat 10:00 a.m. - 11:00 a.m. Lunch Break (Break) 🔗 Sat 11:00 a.m. - 11:20 a.m. Teaching Algorithmic Reasoning via In-context Learning (Contributed Talk)  link » Large language models (LLMs) have shown increasing in-context learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multi-step reasoning problems, Anil et al. (2022) showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via in-context learning, which we refer to as algorithmic prompting. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines. Link » Hattie Zhou · Azade Nova · aaron courville · Hugo Larochelle · Behnam Neyshabur · Hanie Sedghi 🔗 Sat 11:20 a.m. - 11:40 a.m. Solving Math Word Problems with Process-based and Outcome-based Feedback (Contributed Talk)  link » Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% → 12.7% final-answer error and from 14.0% → 3.4% reasoning error among final-answer-correct solutions. Link » Jonathan Uesato · Nate Kushman · Ramana Kumar · H. Francis Song · Noah Siegel · Lisa Wang · Antonia Creswell · Geoffrey Irving · Irina Higgins 🔗 Sat 11:40 a.m. - 12:00 p.m. ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics Problems (Contributed Talk) Zhangir Azerbayev · Bartosz Piotrowski · Jeremy Avigad 🔗 Sat 12:00 p.m. - 12:30 p.m. Towards Systematic Reasoning with Language Models (Invited Talk) Mathematics requires systematic reasoning, namely the step-wise application of knowledge in a sound manner to reach a conclusion. Can language models (LMs) perform this kind of systematic reasoning with knowledge provided to it? Or, even more ambitiously, can LMs reason systematically with their own internal knowledge acquired during pretraining? In this talk, I'll attempt to answer these questions, illustrated with our recent work on using LMs for logical deduction, proof generation, and multistep textual entailment problems. While progress has been made, there is still a way to go. To illustrate this, I'll conclude by posing a (currently unsolved) grand challenge - answering Fermi problems - to the math reasoning community, requiring combining systematic reasoning, mathematics, and world knowledge together. Peter Clark 🔗 Sat 12:30 p.m. - 1:00 p.m. Coffee Break (Break) 🔗 Sat 1:00 p.m. - 1:30 p.m. Leveraging Maths to Understand Transformers (Invited Talk) Francois Charton 🔗 Sat 1:30 p.m. - 2:00 p.m. Learning Mathematical Reasoning for Education (Invited Talk) Noah Goodman 🔗 Sat 2:00 p.m. - 2:55 p.m. MATH-AI: Toward Human-Level Mathematical Reasoning (Discussion Panel) Francois Charton · Noah Goodman · Behnam Neyshabur · Talia Ringer · Daniel Selsam 🔗 Sat 2:55 p.m. - 3:00 p.m. Closing Remarks 🔗 - Neural Combinatorial Logic Circuit Synthesis from Input-Output Examples (Poster) We propose a novel, fully explainable neural approach to synthesis of combinatorial logic circuits from input-output examples. The carrying advantage of our method is that it readily extends to inductive scenarios, where the set of examples is incomplete but still indicative of the desired behaviour. Our method can be employed for a virtually arbitrary choice of atoms - from logic gates to FPGA blocks - as long as they can be formulated in a differentiable fashion, and consistently yields good results for synthesis of practical circuits of increasing size. In particular, we succeed in learning a number of arithmetic, bitwise, and signal-routing operations, and even generalise towards the correct behaviour in inductive scenarios. Our method, attacking a discrete logical synthesis problem with an explainable neural approach, hints at a wider promise for synthesis and reasoning-related tasks. Peter Belcak · Roger Wattenhofer 🔗 - Automatic Generation of Socratic Questions for Learning to Solve Math Word Problems (Poster) []     Socratic questioning is an educational method that allows students to discover answers to complex problems by asking them a series of thoughtful questions. Generation of didactically sound questions is challenging, requiring an understanding of the reasoning process involved in the problem. We hypothesize that such a questioning strategy can not only enhance human performance but also assist the math word problem (MWP) solvers.In this work, we explore the ability of large language models (LMs) in generating sequential questions for guiding math word problem-solving. We propose various guided question generation schemes based on input conditioning and reinforcement learning.On both automatic and human quality evaluations, we find that LMs constrained with desirable question properties generate superior questions and improve the overall performance of a math word problem solver. Kumar Shridhar · Jakub Macina · Menna El-Assady · tanmay sinha · Mrinmaya Sachan 🔗 - Generating Reflexive Polytopes via Sequence Modeling (Poster) []     We train neural network sequence models to generate reflexive lattice polytopes. We demonstrate that they can generate mathematical objects satisfying various geometric properties. We use the completeness of our datasets to give evidence that the models are understanding some underlying structure of the data. Bernt Ivar Utstøl Nødland 🔗 - A Causal Framework to Quantify Robustness of Mathematical Reasoning with Language Models (Poster) []     We have recently witnessed a number of impressive results on hard mathematical reasoning problems with large language models (LLMs). At the same time, the robustness of these models has also been called into question.Building on the idea of behavioral testing, we propose a novel framework, which pins down the causal effect of each factor in the input, e.g., the surface form of the problem text, the operands, and math operators, on the output. By grounding the behavioral analysis in a causal graph describing an intuitive reasoning process, we study the behavior of LLMs in terms of robustness and sensitivity to direct interventions in the input space. We apply our framework on a test bed of bivariate math word problems.Our analysis shows that robustness does not appear to continuously improve as a function of scale, but that the recent LLM, GPT-3-Instruct (175B), achieves a dramatic improvement in both robustness and sensitivity, compared to all other GPT variants. Alessandro Stolfo · Zhijing Jin · Kumar Shridhar · Bernhard Schölkopf · Mrinmaya Sachan 🔗 - What is my math transformer doing? Three results on interpretability and generalization (Poster) We investigate the failure cases and out-of-distribution behavior of transformers trained on matrix inversion, eigen decomposition and eigenvalue calculation. We show that incorrect model predictions still retain deep mathematical properties of the solution (e.g. correct eigenvalues, unit norm of eigenvectors), and that almost all model failures can be attributed to, and predicted from, properties of the problem or solution. This demonstrates that, when in doubt, math transformers do not hallucinate crazy solutions (as was sometimes proposed) but remain roughly right''. We also show that the careful choice of a training dataset can accelerate training, while allowing the model to generalize way out of its training distribution, invalidating the idea that transformersmerely interpolate'' from memorized examples. Francois Charton 🔗 - Learning to Understand Plane Geometry Diagram (Poster)    Geometry diagram parsing plays a key role in geometry problem solving, wherein the primitive extraction and relation parsing remain challenging due to the complex layout and between-primitive relationship. In this paper, we propose a powerful diagram parser based on deep learning and graph reasoning. Specifically, a modified instance segmentation method is proposed to extract geometric primitives, and the graph neural network (GNN) is leveraged to realize relation parsing and primitive classification incorporating geometric features and prior knowledge. All the modules are integrated into an end-to-end model called PGDPNet to perform all the sub-tasks simultaneously. In addition, we build a new large-scale geometry diagram dataset named PGDP5K with primitive level annotations. Experiments on PGDP5K and an existing dataset IMP-Geometry3K show that our model outperforms state-of-the-art methods in four sub-tasks remarkably. The full version of this paper has been accepted by IJCAI 2022. Mlingliang Zhang · Fei yin · Yihan Hao · Cheng-lin Liu 🔗 - Lemma: Bootstrapping High-Level Mathematical Reasoning with Learned Symbolic Abstractions (Poster) []     Humans tame the complexity of mathematical reasoning by developing hierarchies of abstractions.With proper abstractions, solutions to hard problems can be expressed concisely, thus making them more likely to be found.In this paper, we propose Learning Mathematical Abstractions (LEMMA): an algorithm that implements this idea forreinforcement learning agents in mathematical domains.LEMMA augments Expert Iterationwith an abstraction step, where solutions found so far are revisitedand rewritten in terms of new higher-level actions, which thenbecome available to solve new problems.We evaluate LEMMA on two mathematicalreasoning tasks--equation solving and fraction simplification--ina step-by-step fashion.In these two domains,LEMMA improves the ability of an existing agent, bothsolving more problems and generalizing more effectively to harderproblems than those seen during training. Zhening Li · Gabriel Poesia Reis e Silva · Omar Costilla Reyes · Noah Goodman · Armando Solar-Lezama 🔗 - MWP-BERT: A Numeracy-augmented Pre-trained Encoder for Math Word Problems (Poster) []     Math word problem (MWP) solving faces a dilemma in number representation learning. In order to avoid the number representation issue and reduce the search space of feasible solutions, existing works striving for MWP solving usually replace real numbers with symbolic placeholders to focus on logic reasoning. However, instead of the number value itself, it is the reusable numerical property that matters more in numerical reasoning. Therefore, we argue that injecting numerical properties into symbolic placeholders with contextualized representation learning schema canprovide a way out of the dilemma in the number representation issue here. In this work, we introduce this idea to the popular pre-training language model (PLM) techniques and build MWP-BERT, an effective contextual number representation PLM. We demonstrate the effectiveness of our MWP-BERT on MWP solving and several MWP-specific understanding tasks on both English and Chinese benchmarks. Zhenwen Liang · Jipeng ZHANG · Lei Wang · Wei QIN · Jie Shao · Xiangliang Zhang 🔗 - Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems (Poster) []     Recent language models have struggled to generalize to a large range of numbers in numerical reasoning.In this paper, we propose a novel method that leverages simple numbers as anchors to characterize the implicitly inferred arithmetic expressions from language models, and then explicitly applies the expressions to original numbers to get the answers.Experimental results on several numerical reasoning benchmarks demonstrate that our approach is highly effective.More importantly, our approach works in the inference phase without extra model training, making it highly portable and achieving significant and consistent performance benefits across a variety of language models in zero-shot, few-shot, and fine-tuning scenarios. Fan Zhou · Haoyu Dong · Qian Liu · Zhoujun Cheng · Shi Han · Dongmei Zhang 🔗 - EuclidNet: Deep Visual Reasoning for Constructible Problems in Geometry (Poster) []     In this paper, we present a deep learning-based framework for solving geometric construction problems through visual reasoning, which is useful for automated geometry theorem proving. Constructible problems in geometry often ask for the sequence of straightedge-and-compass constructions to construct a given goal given some initial setup. Our EuclidNet framework leverages the neural network architecture Mask R-CNN to extract the visual features from the initial setup and goal configuration with extra points of intersection, and then generate possible construction steps as intermediary data models that are used as feedback in the training process for further refinement of the construction step sequence. This process is repeated recursively until either a solution is found, in which case we backtrack the path for a step-by-step construction guide, or the problem is identified as unsolvable. Our EuclidNet framework is validated on complex Japanese Sangaku geometry problems, demonstrating its capacity to leverage backtracking for deep visual reasoning of challenging problems. Man Fai Wong · Xintong Qi · Chee-Wei Tan 🔗 - Estimating Numbers without Regression (Poster)    Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods change either the (1) notation in which numbers are written (eg scientific vs decimal), the (2) vocabulary used to represent numbers or the entire (3) architecture of the underlying language model, to directly regress to a desired number. In this work, we show that a potential trade-off to the more complex architectural changes is to simply change the model's vocabulary instead, \eg introduce a new token for numbers in range 10-100. In the context of masked number prediction, we find that a carefully designed tokenization scheme is both the simplest to implement and sufficient, i.e., with similar performance to the state-of-the-art approach that requires making significant architectural changes.Finally, we evaluate the various number representation schemes on the downstream task of numerical fact estimation (for Fermi Problems) in a zero-shot setting and find similar trends, i.e., changes at the tokenization level achieve near state-of-the-art results while requiring minimal resources compared to other number representation schemes. Avijit Thawani · Jay Pujara · Ashwin Kalyan 🔗 - Learn to Select Good Examples with Reinforcement Learning for Semi-structured Mathematical Reasoning (Poster) []  Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if models can handle more complex problems that involve heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain problems that require mathematical reasoning on both textual and tabular data, where each question is aligned with a tabular context. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. This issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select good in-context examples from a small amount of training data. Experimental results show that our method outperforms the best baseline by 5.31% in accuracy and reduces the prediction variance significantly compared to random selection. Pan Lu · Liang Qiu · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Tanmay Rajpurohit · Peter Clark · Ashwin Kalyan 🔗 - Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs (Poster) The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems. Albert Jiang · Sean Welleck · Jin Peng Zhou · Timothee Lacroix · Jiacheng Liu · Wenda Li · Mateja Jamnik · Guillaume Lample · Yuhuai Wu 🔗 - Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic (Poster) []     Through their transfer learning abilities, highly-parameterized large pre-trained language models have dominated the NLP landscape for a multitude of downstream language tasks. Though linguistically proficient, the inability of these models to incorporate the learning of non-linguistic entities (numerals and arithmetic reasoning) limits their usage for tasks that require numeric comprehension or strict mathematical reasoning. However, as we illustrate in this paper, building a general purpose language model that also happens to be proficient in mathematical reasoning is not as straight-forward as training it on a numeric dataset. In this work, we develop a novel framework that enables language models to be mathematically proficient while retaining their linguistic prowess. Specifically, we offer information-theoretic interventions to overcome the catastrophic forgetting of linguistic skills that occurs while injecting non-linguistic skills into language models. Mandar Sharma · Nikhil Muralidhar · Naren Ramakrishnan 🔗 - Teaching Algorithmic Reasoning via In-context Learning (Poster)    Large language models (LLMs) have shown increasing in-context learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multi-step reasoning problems, Anil et al. 2022 showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via in-context learning, which we refer to as \emph{algorithmic prompting}. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines. Hattie Zhou · Azade Nova · aaron courville · Hugo Larochelle · Behnam Neyshabur · Hanie Sedghi 🔗 - Broken Neural Scaling Laws (Poster) We present a smoothly broken power law functional form that accurately models the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision and unsupervised language tasks, arithmetic, and reinforcement learning. This functional form yields extrapolations of scaling behavior that often are an order of magnitude more accurate than the ones obtained by other functional forms for neural scaling behavior. Moreover, this functional form accurately models the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Ethan Caballero · Kshitij Gupta · Irina Rish · David Krueger 🔗 - Towards automating formalisation of theorem statements using large language models (Poster)    Mathematics formalisation is the task of writing mathematics (i.e., definitions, theorem statements, proofs) in natural language, as found in books and papers, into a formal language that can then be checked for correctness by a program. It is a thriving activity today, however formalisation remains cumbersome. In this paper, we explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover. We find that with careful input-dependent prompt selection and postprocessing, Codex is able to formalise short mathematical statements at undergrad level with nearly 75% accuracy for 120 theorem statements. Siddhartha Gadgil · Anand Tadipatri · Navin Goyal · Ayush Agrawal · Ashvni Narayanan 🔗 - Graph neural networks for Ramsey graphs (Poster) []     Ramsey-like problems are ubiquitous in extremal combinatorics and occupy a central place in the field. In simple terms, Ramsey theory wishes to find the minimum size of a large graph structure such that some sought substructure - generally a clique or an independent set - is guaranteed to exist. Due to considerations of computational complexity, brute force approaches to solving these problems are usually not very feasible, as the substructures cannot be checked in polynomial time. At the same time, we seek extremal graphs that completely avoid such substructures to better understand the graph theory governing their occurrence. We investigate the feasibility of Graph Neural Networks (GNNs) in terms of indicating and refining search procedures for finding these special classes of Ramsey-extremal graphs, which are of interest to mathematicians. Amur Ghose · Amit Levi · Yingxueff Zhang 🔗 - Improving Compositional Generalization in Math Word Problem Solving (Poster) []     Compositional generalization refers to a model's capability to generalize to newly composed input data based on the data components observed during training. It has triggered a series of compositional generalization analysis on different tasks as generalization is an important aspect of language and problem solving skills. However, the similar discussion on math word problems (MWPs) is limited. In this manuscript, we study compositional generalization in MWP solving. Specifically, we first introduce a data splitting method to create compositional splits from existing MWP datasets. Meanwhile, we synthesize data to isolate the effect of compositions. To improve the compositional generalization in MWP solving, we propose an iterative data augmentation method that includes diverse compositional variation into training data and could collaborate with MWP methods. During the evaluation, we examine a set of methods and find all of them encounter severe performance loss on the evaluated datasets. We also find our data augmentation method could significantly improve the compositional generalization of general MWP methods. Yunshi Lan · Lei Wang · Jing Jiang · Ee-peng Lim 🔗 - ProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate-Level Mathematics Problems (Poster) []  We introduce \textsf{ProofNet}, a benchmark for autoformalization and formal proving of undergraduate-level mathematics. The \textsf{ProofNet} benchmarks consists of 297 theorem statements expressed in both natural language and the Lean 3 theorem prover, 100 of which are also accompanied by natural language proofs. The problems are primarily drawn from popular undergraduate pure mathematics textbooks, and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology. We intend for \textsf{ProofNet} to be a challenging benchmark that will drive progress in autoformalization and automatic theorem proving. We report baseline results on the autoformalization of statements using few-shot learning with large language models. Zhangir Azerbayev · Bartosz Piotrowski · Jeremy Avigad 🔗 - Learning to Reason With Relational Abstractions (Poster) Large language models have recently shown promising progress in mathematical reasoning when fine-tuned with human-generated sequences walking through a sequence of solution steps. However, the solution sequences are not formally structured and the resulting model-generated sequences may not reflect the kind of systematic reasoning we might expect an expert human to produce. In this paper, we study how to build stronger reasoning capability in language models using the idea of relational abstractions. We introduce new types of sequences that more explicitly provide an abstract characterization of the transitions through intermediate solution steps to the goal state. We find that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy, and models that are trained to produce such sequences solve problems better than those that are trained with previously used human-generated sequences and other baselines. Our work thus takes several steps toward elucidating and improving how language models perform on tasks requiring multi-step mathematical reasoning. Andrew Nam · James McClelland · Mengye Ren · Chelsea Finn 🔗 - Out-of-Distribution Generalization in Algorithmic Reasoning Through Curriculum Learning (Poster) Out-of-distribution generalization (OODG) is a longstanding challenge for neural networks, and is quite apparent in tasks with well-defined variables and rules, where explicit use of the rules can solve problems independently of the particular values of the variables. Large transformer-based language models have pushed the boundaries on how well neural networks can generalize to novel inputs, but their complexity obfuscates they achieve such robustness. As a step toward understanding how transformer-based systems generalize, we explore the question of OODG in smaller scale transformers. Using a reasoning task based on the puzzle Sudoku, we show that OODG can occur on complex problems if the training set includes examples sampled from the whole distribution of simpler component tasks. Andrew Nam · Mustafa Abdool · Trevor Maxfield · James McClelland 🔗 - On the Abilities of Mathematical Extrapolation with Implicit Models (Poster) []  Deep neural networks excel on a variety of different tasks, often surpassing human intelligence. However, when presented with out-of-distribution data, these models tend to break down even on the simplest tasks. In this paper, we compare implicitly-defined and classical deep learning models on a series of mathematical extrapolation tasks, where the models are tested with out-of-distribution samples during inference time. Throughout our experiments, implicit models greatly outperform classical deep learning networks that overfit the training distribution. We showcase implicit models' unique advantages for extrapolation thanks to their flexible and selective framework. Thanks to their potentially unlimited depth, implicit models not only adapt well to out-of-distribution inputs but also understand the underlying structure of inputs much better. Alicia Tsai · Juliette Decugis · Ashwin Ganesh · Max Emerling · Laurent El Ghaoui 🔗 - Program Synthesis for Integer Sequence Generation (Poster) []     Recent advances in program synthesis have shown success with methods that employ deep learning on synthetic data generated from domain specific languages (DSLs). In this work, we propose an algorithm for program synthesis that extends these methods. It uses transfer learning from pre-trained language models, and employs a policy improvement operator based on policy-guided search. This hybrid approach combats the challenges of searching a large language space with sparse rewards. We show its effectiveness on the task of integer sequence generation, a special case of programming-by-examples with fixed inputs. Our preliminary results demonstrate that the inclusion of policy-guided search leads to a 1.6% increase in the number of correct programs compared to supervised baselines. Natasha Butt · Auke Wiggers · Taco Cohen · Max Welling 🔗 - LILA: A Unified Benchmark for Mathematical Reasoning (Poster) []     Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g arithmetic, calculus, (ii) language format e.g. question-answering, fill-in-the-blanks, (iii) language diversity e.g. no language, simple language, (iv) external knowledge e.g. commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA and its variants, a family of mathematical reasoning models fine-tuned on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding. Swaroop Mishra · Matthew Finlayson · Pan Lu · Leonard Tang · Sean Welleck · Chitta Baral · Tanmay Rajpurohit · Oyvind Tafjord · Ashish Sabharwal · Peter Clark · Ashwin Kalyan 🔗 - Solving Math Word Problems with Process-based and Outcome-based Feedback (Poster) []  Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% to 12.7% final-answer error and from 14.0% to 3.4% reasoning error among final-answer-correct solutions. Jonathan Uesato · Nate Kushman · Ramana Kumar · H. Francis Song · Noah Siegel · Lisa Wang · Antonia Creswell · Geoffrey Irving · Irina Higgins 🔗