Timezone: »
Mathematical reasoning is a unique aspect of human intelligence and a fundamental building block for scientific and intellectual pursuits. However, learning mathematics is often a challenging human endeavor that relies on expert instructors to create, teach and evaluate mathematical material. From an educational perspective, AI systems that aid in this process offer increased inclusion and accessibility, efficiency, and understanding of mathematics. Moreover, building systems capable of understanding, creating, and using mathematics offers a unique setting for studying reasoning in AI. This workshop will investigate the intersection of mathematics education and AI.
Sat 6:55 a.m.  7:00 a.m.

Introduction and Opening Remarks
(Opening Remarks)
SlidesLive Video » 
🔗 
Sat 7:00 a.m.  7:30 a.m.

Reasoning and Abstraction as Challenges for AI
(Invited Talk)
SlidesLive Video » 
Cezary Kaliszyk 🔗 
Sat 7:30 a.m.  8:00 a.m.

Length Generalization in Quantitative Reasoning
(Invited Talk)
SlidesLive Video » 
Behnam Neyshabur 🔗 
Sat 8:00 a.m.  8:30 a.m.

Has Progress on Math been Surprising?
(Invited Talk)
SlidesLive Video » In 2021, we commissioned forecasters to predict progress on ML benchmarks, including the MATH dataset for mathematical problemsolving. Progress on MATH ended up being much faster than predicted. I'll discuss what we should and shouldn't take away from this, my own predictions for future progress, and general implications for predicting future developments in ML. 
Jacob Steinhardt 🔗 
Sat 8:30 a.m.  10:00 a.m.

Poster Session

🔗 
Sat 10:00 a.m.  11:00 a.m.

Lunch Break
(Break)

🔗 
Sat 11:00 a.m.  11:20 a.m.

Teaching Algorithmic Reasoning via Incontext Learning
(Contributed Talk)
link »
SlidesLive Video » Large language models (LLMs) have shown increasing incontext learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multistep reasoning problems, Anil et al. (2022) showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via incontext learning, which we refer to as algorithmic prompting. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines. 
Hattie Zhou · Azade Nova · aaron courville · Hugo Larochelle · Behnam Neyshabur · Hanie Sedghi 🔗 
Sat 11:20 a.m.  11:40 a.m.

Solving Math Word Problems with Processbased and Outcomebased Feedback
(Contributed Talk)
link »
SlidesLive Video » Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcomebased approaches which supervise the final result, or processbased approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in finalanswer errors but also in reasoning errors, which can be difficult to detect and are problematic in many realworld domains such as education. We run the first comprehensive comparison between process and outcomebased approaches trained on a natural language task, GSM8K. We find that pure outcomebased supervision produces similar finalanswer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use processbased supervision or supervision from learned reward models that emulate processbased feedback. In total, we improve the previous best results from 16.8% → 12.7% finalanswer error and from 14.0% → 3.4% reasoning error among finalanswercorrect solutions. 
Jonathan Uesato · Nate Kushman · Ramana Kumar · H. Francis Song · Noah Siegel · Lisa Wang · Antonia Creswell · Geoffrey Irving · Irina Higgins 🔗 
Sat 11:40 a.m.  12:00 p.m.

ProofNet: A Benchmark for Autoformalizing and Formally Proving UndergraduateLevel Mathematics Problems
(Contributed Talk)
SlidesLive Video » 
Zhangir Azerbayev · Bartosz Piotrowski · Jeremy Avigad 🔗 
Sat 12:00 p.m.  12:30 p.m.

Towards Systematic Reasoning with Language Models
(Invited Talk)
SlidesLive Video » Mathematics requires systematic reasoning, namely the stepwise application of knowledge in a sound manner to reach a conclusion. Can language models (LMs) perform this kind of systematic reasoning with knowledge provided to it? Or, even more ambitiously, can LMs reason systematically with their own internal knowledge acquired during pretraining? In this talk, I'll attempt to answer these questions, illustrated with our recent work on using LMs for logical deduction, proof generation, and multistep textual entailment problems. While progress has been made, there is still a way to go. To illustrate this, I'll conclude by posing a (currently unsolved) grand challenge  answering Fermi problems  to the math reasoning community, requiring combining systematic reasoning, mathematics, and world knowledge together. 
Peter Clark 🔗 
Sat 12:30 p.m.  1:00 p.m.

Coffee Break
(Break)

🔗 
Sat 1:00 p.m.  1:30 p.m.

Leveraging Maths to Understand Transformers
(Invited Talk)
SlidesLive Video » 
Francois Charton 🔗 
Sat 1:30 p.m.  2:00 p.m.

Learning Mathematical Reasoning for Education
(Invited Talk)
SlidesLive Video » 
Noah Goodman 🔗 
Sat 2:00 p.m.  2:55 p.m.

MATHAI: Toward HumanLevel Mathematical Reasoning
(Discussion Panel)
SlidesLive Video » 
Francois Charton · Noah Goodman · Behnam Neyshabur · Talia Ringer · Daniel Selsam 🔗 
Sat 2:55 p.m.  3:00 p.m.

Closing Remarks

🔗 


Neural Combinatorial Logic Circuit Synthesis from InputOutput Examples
(Poster)
We propose a novel, fully explainable neural approach to synthesis of combinatorial logic circuits from inputoutput examples. The carrying advantage of our method is that it readily extends to inductive scenarios, where the set of examples is incomplete but still indicative of the desired behaviour. Our method can be employed for a virtually arbitrary choice of atoms  from logic gates to FPGA blocks  as long as they can be formulated in a differentiable fashion, and consistently yields good results for synthesis of practical circuits of increasing size. In particular, we succeed in learning a number of arithmetic, bitwise, and signalrouting operations, and even generalise towards the correct behaviour in inductive scenarios. Our method, attacking a discrete logical synthesis problem with an explainable neural approach, hints at a wider promise for synthesis and reasoningrelated tasks. 
Peter Belcak · Roger Wattenhofer 🔗 


Automatic Generation of Socratic Questions for Learning to Solve Math Word Problems
(Poster)
SlidesLive Video » Socratic questioning is an educational method that allows students to discover answers to complex problems by asking them a series of thoughtful questions. Generation of didactically sound questions is challenging, requiring an understanding of the reasoning process involved in the problem. We hypothesize that such a questioning strategy can not only enhance human performance but also assist the math word problem (MWP) solvers.In this work, we explore the ability of large language models (LMs) in generating sequential questions for guiding math word problemsolving. We propose various guided question generation schemes based on input conditioning and reinforcement learning.On both automatic and human quality evaluations, we find that LMs constrained with desirable question properties generate superior questions and improve the overall performance of a math word problem solver. 
Kumar Shridhar · Jakub Macina · Menna ElAssady · tanmay sinha · Mrinmaya Sachan 🔗 


Generating Reflexive Polytopes via Sequence Modeling
(Poster)
SlidesLive Video » We train neural network sequence models to generate reflexive lattice polytopes. We demonstrate that they can generate mathematical objects satisfying various geometric properties. We use the completeness of our datasets to give evidence that the models are understanding some underlying structure of the data. 
Bernt Ivar Utstøl Nødland 🔗 


A Causal Framework to Quantify Robustness of Mathematical Reasoning with Language Models
(Poster)
SlidesLive Video » We have recently witnessed a number of impressive results on hard mathematical reasoning problems with large language models (LLMs). At the same time, the robustness of these models has also been called into question.Building on the idea of behavioral testing, we propose a novel framework, which pins down the causal effect of each factor in the input, e.g., the surface form of the problem text, the operands, and math operators, on the output. By grounding the behavioral analysis in a causal graph describing an intuitive reasoning process, we study the behavior of LLMs in terms of robustness and sensitivity to direct interventions in the input space. We apply our framework on a test bed of bivariate math word problems.Our analysis shows that robustness does not appear to continuously improve as a function of scale, but that the recent LLM, GPT3Instruct (175B), achieves a dramatic improvement in both robustness and sensitivity, compared to all other GPT variants. 
Alessandro Stolfo · Zhijing Jin · Kumar Shridhar · Bernhard Schölkopf · Mrinmaya Sachan 🔗 


What is my math transformer doing? Three results on interpretability and generalization
(Poster)
We investigate the failure cases and outofdistribution behavior of transformers trained on matrix inversion, eigen decomposition and eigenvalue calculation. We show that incorrect model predictions still retain deep mathematical properties of the solution (e.g. correct eigenvalues, unit norm of eigenvectors), and that almost all model failures can be attributed to, and predicted from, properties of the problem or solution. This demonstrates that, when in doubt, math transformers do not hallucinate crazy solutions (as was sometimes proposed) but remain 
Francois Charton 🔗 


Learning to Understand Plane Geometry Diagram
(Poster)
SlidesLive Video » Geometry diagram parsing plays a key role in geometry problem solving, wherein the primitive extraction and relation parsing remain challenging due to the complex layout and betweenprimitive relationship. In this paper, we propose a powerful diagram parser based on deep learning and graph reasoning. Specifically, a modified instance segmentation method is proposed to extract geometric primitives, and the graph neural network (GNN) is leveraged to realize relation parsing and primitive classification incorporating geometric features and prior knowledge. All the modules are integrated into an endtoend model called PGDPNet to perform all the subtasks simultaneously. In addition, we build a new largescale geometry diagram dataset named PGDP5K with primitive level annotations. Experiments on PGDP5K and an existing dataset IMPGeometry3K show that our model outperforms stateoftheart methods in four subtasks remarkably. The full version of this paper has been accepted by IJCAI 2022. 
Mlingliang Zhang · Fei yin · Yihan Hao · Chenglin Liu 🔗 


Lemma: Bootstrapping HighLevel Mathematical Reasoning with Learned Symbolic Abstractions
(Poster)
SlidesLive Video » Humans tame the complexity of mathematical reasoning by developing hierarchies of abstractions.With proper abstractions, solutions to hard problems can be expressed concisely, thus making them more likely to be found.In this paper, we propose Learning Mathematical Abstractions (LEMMA): an algorithm that implements this idea forreinforcement learning agents in mathematical domains.LEMMA augments Expert Iterationwith an abstraction step, where solutions found so far are revisitedand rewritten in terms of new higherlevel actions, which thenbecome available to solve new problems.We evaluate LEMMA on two mathematicalreasoning tasksequation solving and fraction simplificationina stepbystep fashion.In these two domains,LEMMA improves the ability of an existing agent, bothsolving more problems and generalizing more effectively to harderproblems than those seen during training. 
Zhening Li · Gabriel Poesia Reis e Silva · Omar Costilla Reyes · Noah Goodman · Armando SolarLezama 🔗 


MWPBERT: A Numeracyaugmented Pretrained Encoder for Math Word Problems
(Poster)
SlidesLive Video » Math word problem (MWP) solving faces a dilemma in number representation learning. In order to avoid the number representation issue and reduce the search space of feasible solutions, existing works striving for MWP solving usually replace real numbers with symbolic placeholders to focus on logic reasoning. However, instead of the number value itself, it is the reusable numerical property that matters more in numerical reasoning. Therefore, we argue that injecting numerical properties into symbolic placeholders with contextualized representation learning schema canprovide a way out of the dilemma in the number representation issue here. In this work, we introduce this idea to the popular pretraining language model (PLM) techniques and build MWPBERT, an effective contextual number representation PLM. We demonstrate the effectiveness of our MWPBERT on MWP solving and several MWPspecific understanding tasks on both English and Chinese benchmarks. 
Zhenwen Liang · Jipeng ZHANG · Lei Wang · Wei QIN · Jie Shao · Xiangliang Zhang 🔗 


Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems
(Poster)
SlidesLive Video » Recent language models have struggled to generalize to a large range of numbers in numerical reasoning.In this paper, we propose a novel method that leverages simple numbers as anchors to characterize the implicitly inferred arithmetic expressions from language models, and then explicitly applies the expressions to original numbers to get the answers.Experimental results on several numerical reasoning benchmarks demonstrate that our approach is highly effective.More importantly, our approach works in the inference phase without extra model training, making it highly portable and achieving significant and consistent performance benefits across a variety of language models in zeroshot, fewshot, and finetuning scenarios. 
Fan Zhou · Haoyu Dong · Qian Liu · Zhoujun Cheng · Shi Han · Dongmei Zhang 🔗 


EuclidNet: Deep Visual Reasoning for Constructible Problems in Geometry
(Poster)
SlidesLive Video » In this paper, we present a deep learningbased framework for solving geometric construction problems through visual reasoning, which is useful for automated geometry theorem proving. Constructible problems in geometry often ask for the sequence of straightedgeandcompass constructions to construct a given goal given some initial setup. Our EuclidNet framework leverages the neural network architecture Mask RCNN to extract the visual features from the initial setup and goal configuration with extra points of intersection, and then generate possible construction steps as intermediary data models that are used as feedback in the training process for further refinement of the construction step sequence. This process is repeated recursively until either a solution is found, in which case we backtrack the path for a stepbystep construction guide, or the problem is identified as unsolvable. Our EuclidNet framework is validated on complex Japanese Sangaku geometry problems, demonstrating its capacity to leverage backtracking for deep visual reasoning of challenging problems. 
Man Fai Wong · Xintong Qi · CheeWei Tan 🔗 


Estimating Numbers without Regression
(Poster)
SlidesLive Video » Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods change either the (1) notation in which numbers are written (eg scientific vs decimal), the (2) vocabulary used to represent numbers or the entire (3) architecture of the underlying language model, to directly regress to a desired number. In this work, we show that a potential tradeoff to the more complex architectural changes is to simply change the model's vocabulary instead, \eg introduce a new token for numbers in range 10100. In the context of masked number prediction, we find that a carefully designed tokenization scheme is both the simplest to implement and sufficient, i.e., with similar performance to the stateoftheart approach that requires making significant architectural changes.Finally, we evaluate the various number representation schemes on the downstream task of numerical fact estimation (for Fermi Problems) in a zeroshot setting and find similar trends, i.e., changes at the tokenization level achieve near stateoftheart results while requiring minimal resources compared to other number representation schemes. 
Avijit Thawani · Jay Pujara · Ashwin Kalyan 🔗 


Learn to Select Good Examples with Reinforcement Learning for Semistructured Mathematical Reasoning
(Poster)
Recent large pretrained language models such as GPT3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if models can handle more complex problems that involve heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 opendomain problems that require mathematical reasoning on both textual and tabular data, where each question is aligned with a tabular context. We evaluate different pretrained models on TabMWP, including the GPT3 model in a fewshot setting. As earlier studies suggest, since fewshot GPT3 relies on the selection of incontext examples, its performance is unstable and can degrade to near chance. This issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select good incontext examples from a small amount of training data. Experimental results show that our method outperforms the best baseline by 5.31% in accuracy and reduces the prediction variance significantly compared to random selection. 
Pan Lu · Liang Qiu · KaiWei Chang · Ying Nian Wu · SongChun Zhu · Tanmay Rajpurohit · Peter Clark · Ashwin Kalyan 🔗 


Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
(Poster)
The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier subproblems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce wellstructured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems. 
Albert Jiang · Sean Welleck · Jin Peng Zhou · Timothee Lacroix · Jiacheng Liu · Wenda Li · Mateja Jamnik · Guillaume Lample · Yuhuai Wu 🔗 


Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic
(Poster)
SlidesLive Video » Through their transfer learning abilities, highlyparameterized large pretrained language models have dominated the NLP landscape for a multitude of downstream language tasks. Though linguistically proficient, the inability of these models to incorporate the learning of nonlinguistic entities (numerals and arithmetic reasoning) limits their usage for tasks that require numeric comprehension or strict mathematical reasoning. However, as we illustrate in this paper, building a general purpose language model that also happens to be proficient in mathematical reasoning is not as straightforward as training it on a numeric dataset. In this work, we develop a novel framework that enables language models to be mathematically proficient while retaining their linguistic prowess. Specifically, we offer informationtheoretic interventions to overcome the catastrophic forgetting of linguistic skills that occurs while injecting nonlinguistic skills into language models. 
Mandar Sharma · Nikhil Muralidhar · Naren Ramakrishnan 🔗 


Teaching Algorithmic Reasoning via Incontext Learning
(Poster)
SlidesLive Video » Large language models (LLMs) have shown increasing incontext learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multistep reasoning problems, Anil et al. 2022 showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via incontext learning, which we refer to as \emph{algorithmic prompting}. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines. 
Hattie Zhou · Azade Nova · aaron courville · Hugo Larochelle · Behnam Neyshabur · Hanie Sedghi 🔗 


Broken Neural Scaling Laws
(Poster)
We present a smoothly broken power law functional form that accurately models the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task within a large and diverse set of upstream and downstream tasks, in zeroshot, prompted, and finetuned settings. This set includes largescale vision and unsupervised language tasks, arithmetic, and reinforcement learning. This functional form yields extrapolations of scaling behavior that often are an order of magnitude more accurate than the ones obtained by other functional forms for neural scaling behavior. Moreover, this functional form accurately models the nonmonotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. 
Ethan Caballero · Kshitij Gupta · Irina Rish · David Krueger 🔗 


Towards automating formalisation of theorem statements using large language models
(Poster)
SlidesLive Video » Mathematics formalisation is the task of writing mathematics (i.e., definitions, theorem statements, proofs) in natural language, as found in books and papers, into a formal language that can then be checked for correctness by a program. It is a thriving activity today, however formalisation remains cumbersome. In this paper, we explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover. We find that with careful inputdependent prompt selection and postprocessing, Codex is able to formalise short mathematical statements at undergrad level with nearly 75% accuracy for 120 theorem statements. 
Siddhartha Gadgil · Anand Tadipatri · Navin Goyal · Ayush Agrawal · Ashvni Narayanan 🔗 


Graph neural networks for Ramsey graphs
(Poster)
SlidesLive Video » Ramseylike problems are ubiquitous in extremal combinatorics and occupy a central place in the field. In simple terms, Ramsey theory wishes to find the minimum size of a large graph structure such that some sought substructure  generally a clique or an independent set  is guaranteed to exist. Due to considerations of computational complexity, brute force approaches to solving these problems are usually not very feasible, as the substructures cannot be checked in polynomial time. At the same time, we seek extremal graphs that completely avoid such substructures to better understand the graph theory governing their occurrence. We investigate the feasibility of Graph Neural Networks (GNNs) in terms of indicating and refining search procedures for finding these special classes of Ramseyextremal graphs, which are of interest to mathematicians. 
Amur Ghose · Amit Levi · Yingxueff Zhang 🔗 


Improving Compositional Generalization in Math Word Problem Solving
(Poster)
SlidesLive Video » Compositional generalization refers to a model's capability to generalize to newly composed input data based on the data components observed during training. It has triggered a series of compositional generalization analysis on different tasks as generalization is an important aspect of language and problem solving skills. However, the similar discussion on math word problems (MWPs) is limited. In this manuscript, we study compositional generalization in MWP solving. Specifically, we first introduce a data splitting method to create compositional splits from existing MWP datasets. Meanwhile, we synthesize data to isolate the effect of compositions. To improve the compositional generalization in MWP solving, we propose an iterative data augmentation method that includes diverse compositional variation into training data and could collaborate with MWP methods. During the evaluation, we examine a set of methods and find all of them encounter severe performance loss on the evaluated datasets. We also find our data augmentation method could significantly improve the compositional generalization of general MWP methods. 
Yunshi Lan · Lei Wang · Jing Jiang · Eepeng Lim 🔗 


ProofNet: A Benchmark for Autoformalizing and Formally Proving UndergraduateLevel Mathematics Problems
(Poster)
We introduce \textsf{ProofNet}, a benchmark for autoformalization and formal proving of undergraduatelevel mathematics. The \textsf{ProofNet} benchmarks consists of 297 theorem statements expressed in both natural language and the Lean 3 theorem prover, 100 of which are also accompanied by natural language proofs. The problems are primarily drawn from popular undergraduate pure mathematics textbooks, and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology. We intend for \textsf{ProofNet} to be a challenging benchmark that will drive progress in autoformalization and automatic theorem proving. We report baseline results on the autoformalization of statements using fewshot learning with large language models. 
Zhangir Azerbayev · Bartosz Piotrowski · Jeremy Avigad 🔗 


Learning to Reason With Relational Abstractions
(Poster)
Large language models have recently shown promising progress in mathematical reasoning when finetuned with humangenerated sequences walking through a sequence of solution steps. However, the solution sequences are not formally structured and the resulting modelgenerated sequences may not reflect the kind of systematic reasoning we might expect an expert human to produce. In this paper, we study how to build stronger reasoning capability in language models using the idea of relational abstractions. We introduce new types of sequences that more explicitly provide an abstract characterization of the transitions through intermediate solution steps to the goal state. We find that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy, and models that are trained to produce such sequences solve problems better than those that are trained with previously used humangenerated sequences and other baselines. Our work thus takes several steps toward elucidating and improving how language models perform on tasks requiring multistep mathematical reasoning. 
Andrew Nam · James McClelland · Mengye Ren · Chelsea Finn 🔗 


OutofDistribution Generalization in Algorithmic Reasoning Through Curriculum Learning
(Poster)
Outofdistribution generalization (OODG) is a longstanding challenge for neural networks, and is quite apparent in tasks with welldefined variables and rules, where explicit use of the rules can solve problems independently of the particular values of the variables. Large transformerbased language models have pushed the boundaries on how well neural networks can generalize to novel inputs, but their complexity obfuscates they achieve such robustness. As a step toward understanding how transformerbased systems generalize, we explore the question of OODG in smaller scale transformers. Using a reasoning task based on the puzzle Sudoku, we show that OODG can occur on complex problems if the training set includes examples sampled from the whole distribution of simpler component tasks. 
Andrew Nam · Mustafa Abdool · Trevor Maxfield · James McClelland 🔗 


On the Abilities of Mathematical Extrapolation with Implicit Models
(Poster)
Deep neural networks excel on a variety of different tasks, often surpassing human intelligence. However, when presented with outofdistribution data, these models tend to break down even on the simplest tasks. In this paper, we compare implicitlydefined and classical deep learning models on a series of mathematical extrapolation tasks, where the models are tested with outofdistribution samples during inference time. Throughout our experiments, implicit models greatly outperform classical deep learning networks that overfit the training distribution. We showcase implicit models' unique advantages for extrapolation thanks to their flexible and selective framework. Thanks to their potentially unlimited depth, implicit models not only adapt well to outofdistribution inputs but also understand the underlying structure of inputs much better. 
Alicia Tsai · Juliette Decugis · Ashwin Ganesh · Max Emerling · Laurent El Ghaoui 🔗 


Program Synthesis for Integer Sequence Generation
(Poster)
SlidesLive Video » Recent advances in program synthesis have shown success with methods that employ deep learning on synthetic data generated from domain specific languages (DSLs). In this work, we propose an algorithm for program synthesis that extends these methods. It uses transfer learning from pretrained language models, and employs a policy improvement operator based on policyguided search. This hybrid approach combats the challenges of searching a large language space with sparse rewards. We show its effectiveness on the task of integer sequence generation, a special case of programmingbyexamples with fixed inputs. Our preliminary results demonstrate that the inclusion of policyguided search leads to a 1.6% increase in the number of correct programs compared to supervised baselines. 
Natasha Butt · Auke Wiggers · Taco Cohen · Max Welling 🔗 


LILA: A Unified Benchmark for Mathematical Reasoning
(Poster)
SlidesLive Video » Mathematical reasoning skills are essential for generalpurpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g arithmetic, calculus, (ii) language format e.g. questionanswering, fillintheblanks, (iii) language diversity e.g. no language, simple language, (iv) external knowledge e.g. commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We introduce two evaluation datasets to measure outofdistribution performance and robustness to language perturbation. Finally, we introduce BHASKARA and its variants, a family of mathematical reasoning models finetuned on LILA. Importantly, we find that multitasking leads to significant improvements (average relative improvement of 21.83% F1 score vs singletask models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding. 
Swaroop Mishra · Matthew Finlayson · Pan Lu · Leonard Tang · Sean Welleck · Chitta Baral · Tanmay Rajpurohit · Oyvind Tafjord · Ashish Sabharwal · Peter Clark · Ashwin Kalyan



Solving Math Word Problems with Processbased and Outcomebased Feedback
(Poster)
Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcomebased approaches which supervise the final result, or processbased approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in finalanswer errors but also in reasoning errors, which can be difficult to detect and are problematic in many realworld domains such as education. We run the first comprehensive comparison between process and outcomebased approaches trained on a natural language task, GSM8K. We find that pure outcomebased supervision produces similar finalanswer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use processbased supervision or supervision from learned reward models that emulate processbased feedback. In total, we improve the previous best results from 16.8% to 12.7% finalanswer error and from 14.0% to 3.4% reasoning error among finalanswercorrect solutions. 
Jonathan Uesato · Nate Kushman · Ramana Kumar · H. Francis Song · Noah Siegel · Lisa Wang · Antonia Creswell · Geoffrey Irving · Irina Higgins 🔗 
Author Information
Pan Lu (UCLA; AI2)
Swaroop Mishra (Arizona State University)
Sean Welleck (University of Washington)
Yuhuai Wu (Google)
Hannaneh Hajishirzi (University of Washington)
Percy Liang (Stanford University)
More from the Same Authors

2020 : Invited Talk 8 Presentation  Percy Liang  Semantic Parsing for Natural Language Interfaces »
Percy Liang 
2021 : NaturalProofs: Mathematical Theorem Proving in Natural Language »
Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho 
2021 : IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning »
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · SongChun Zhu 
2021 : TheoremAware Geometry Problem Solving with Symbolic Reasoning and Theorem Prediction »
Pan Lu · Ran Gong · Shibiao Jiang · Liang Qiu · Siyuan Huang · Xiaodan Liang · SongChun Zhu · Ran Gong 
2021 : Towards Diagram Understanding and Cognitive Reasoning in Icon Question Answering »
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · SongChun Zhu 
2022 : Learn to Select Good Examples with Reinforcement Learning for Semistructured Mathematical Reasoning »
Pan Lu · Liang Qiu · KaiWei Chang · Ying Nian Wu · SongChun Zhu · Tanmay Rajpurohit · Peter Clark · Ashwin Kalyan 
2022 : LILA: A Unified Benchmark for Mathematical Reasoning »
Swaroop Mishra · Matthew Finlayson · Pan Lu · Leonard Tang · Sean Welleck · Chitta Baral · Tanmay Rajpurohit · Oyvind Tafjord · Ashish Sabharwal · Peter Clark · Ashwin Kalyan 
2022 : ContextNER: Contextual Phrase Generation at Scale »
Himanshu Gupta · Shreyas Verma · Tarun Kumar · Swaroop Mishra · Tamanna Agrawal · Amogh Badugu · Himanshu Bhatt 
2022 : OutofDistribution Robustness via Targeted Augmentations »
Irena Gao · Shiori Sagawa · Pang Wei Koh · Tatsunori Hashimoto · Percy Liang 
2022 : Surgical FineTuning Improves Adaptation to Distribution Shifts »
Yoonho Lee · Annie Chen · Fahim Tajwar · Ananya Kumar · Huaxiu Yao · Percy Liang · Chelsea Finn 
2022 : Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search »
Michał Zawalski · Michał Tyrolski · Konrad Czechowski · Damian Stachura · Piotr Piękos · Tomasz Odrzygóźdź · Yuhuai Wu · Łukasz Kuciński · Piotr Miłoś 
2022 : Surgical FineTuning Improves Adaptation to Distribution Shifts »
Yoonho Lee · Annie Chen · Fahim Tajwar · Ananya Kumar · Huaxiu Yao · Percy Liang · Chelsea Finn 
2022 : FineTuning without Distortion: Improving Robustness to Distribution Shifts »
Percy Liang · Ananya Kumar 
2022 Poster: Autoformalization with Large Language Models »
Yuhuai Wu · Albert Qiaochu Jiang · Wenda Li · Markus Rabe · Charles Staats · Mateja Jamnik · Christian Szegedy 
2022 Poster: Patching openvocabulary models by interpolating weights »
Gabriel Ilharco · Mitchell Wortsman · Samir Yitzhak Gadre · Shuran Song · Hannaneh Hajishirzi · Simon Kornblith · Ali Farhadi · Ludwig Schmidt 
2022 Poster: What Can Transformers Learn InContext? A Case Study of Simple Function Classes »
Shivam Garg · Dimitris Tsipras · Percy Liang · Gregory Valiant 
2022 Poster: Insights into Pretraining via Simpler Synthetic Tasks »
Yuhuai Wu · Felix Li · Percy Liang 
2022 Poster: COLD Decoding: Energybased Constrained Text Generation with Langevin Dynamics »
Lianhui Qin · Sean Welleck · Daniel Khashabi · Yejin Choi 
2022 Poster: Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers »
Albert Qiaochu Jiang · Wenda Li · Szymon Tworkowski · Konrad Czechowski · Tomasz Odrzygóźdź · Piotr Miłoś · Yuhuai Wu · Mateja Jamnik 
2022 Poster: Deep Bidirectional LanguageKnowledge Graph Pretraining »
Michihiro Yasunaga · Antoine Bosselut · Hongyu Ren · Xikun Zhang · Christopher D Manning · Percy Liang · Jure Leskovec 
2022 Poster: Decentralized Training of Foundation Models in Heterogeneous Environments »
Binhang Yuan · Yongjun He · Jared Davis · Tianyi Zhang · Tri Dao · Beidi Chen · Percy Liang · Christopher Ré · Ce Zhang 
2022 Poster: DiffusionLM Improves Controllable Text Generation »
Xiang Li · John Thickstun · Ishaan Gulrajani · Percy Liang · Tatsunori Hashimoto 
2022 Poster: STaR: Bootstrapping Reasoning With Reasoning »
Eric Zelikman · Yuhuai Wu · Jesse Mu · Noah Goodman 
2022 Poster: Picking on the Same Person: Does Algorithmic Monoculture lead to Outcome Homogenization? »
Rishi Bommasani · Kathleen A. Creel · Ananya Kumar · Dan Jurafsky · Percy Liang 
2022 Poster: Exploring Length Generalization in Large Language Models »
Cem Anil · Yuhuai Wu · Anders Andreassen · Aitor Lewkowycz · Vedant Misra · Vinay Ramasesh · Ambrose Slone · Guy GurAri · Ethan Dyer · Behnam Neyshabur 
2022 Poster: Improving SelfSupervised Learning by Characterizing Idealized Representations »
Yann Dubois · Stefano Ermon · Tatsunori Hashimoto · Percy Liang 
2022 Poster: Solving Quantitative Reasoning Problems with Language Models »
Aitor Lewkowycz · Anders Andreassen · David Dohan · Ethan Dyer · Henryk Michalewski · Vinay Ramasesh · Ambrose Slone · Cem Anil · Imanol Schlag · Theo GutmanSolo · Yuhuai Wu · Behnam Neyshabur · Guy GurAri · Vedant Misra 
2022 Poster: QUARK: Controllable Text Generation with Reinforced Unlearning »
Ximing Lu · Sean Welleck · Jack Hessel · Liwei Jiang · Lianhui Qin · Peter West · Prithviraj Ammanabrolu · Yejin Choi 
2022 Poster: Path Independent Equilibrium Models Can Better Exploit TestTime Computation »
Cem Anil · Ashwini Pokle · Kaiqu Liang · Johannes Treutlein · Yuhuai Wu · Shaojie Bai · J. Zico Kolter · Roger Grosse 
2022 Poster: NaturalProver: Grounded Mathematical Proof Generation with Language Models »
Sean Welleck · Jiacheng Liu · Ximing Lu · Hannaneh Hajishirzi · Yejin Choi 
2022 Poster: BlockRecurrent Transformers »
DeLesley Hutchins · Imanol Schlag · Yuhuai Wu · Ethan Dyer · Behnam Neyshabur 
2022 Poster: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering »
Pan Lu · Swaroop Mishra · Tanglin Xia · Liang Qiu · KaiWei Chang · SongChun Zhu · Oyvind Tafjord · Peter Clark · Ashwin Kalyan 
2021 Workshop: Math AI for Education (MATHAI4ED): Bridging the Gap Between Research and Smart Education »
Pan Lu · Yuhuai Wu · Sean Welleck · Xiaodan Liang · Eric Xing · James McClelland 
2021 Workshop: Distribution shifts: connecting methods and applications (DistShift) »
Shiori Sagawa · Pang Wei Koh · Fanny Yang · Hongseok Namkoong · Jiashi Feng · Kate Saenko · Percy Liang · Sarah Bird · Sergey Levine 
2021 : NaturalProofs: Mathematical Theorem Proving in Natural Language »
Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho 
2021 Poster: Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals »
Lang Liu · Krishna Pillutla · Sean Welleck · Sewoong Oh · Yejin Choi · Zaid Harchaoui 
2021 Poster: MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers »
Krishna Pillutla · Swabha Swayamdipta · Rowan Zellers · John Thickstun · Sean Welleck · Yejin Choi · Zaid Harchaoui 
2021 Oral: MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers »
Krishna Pillutla · Swabha Swayamdipta · Rowan Zellers · John Thickstun · Sean Welleck · Yejin Choi · Zaid Harchaoui 
2020 : Invited Talk 8 Q/A  Percy Liang »
Percy Liang 
2020 : VAIDA: An Educative Benchmark Creation Paradigm using Visual Analytics for Interactively Discouraging Artifacts (by Anjana Arunkumar, Swaroop Mishra, Bhavdeep Sachdeva, Chitta Baral and Chris Bryan) »
Anjana Arunkumar · Swaroop Mishra · Chitta Baral 
2020 Poster: Enabling certification of verificationagnostic networks via memoryefficient semidefinite programming »
Sumanth Dathathri · Krishnamurthy Dvijotham · Alexey Kurakin · Aditi Raghunathan · Jonathan Uesato · Rudy Bunel · Shreya Shankar · Jacob Steinhardt · Ian Goodfellow · Percy Liang · Pushmeet Kohli 
2019 : Extended Poster Session »
Travis LaCroix · Marie Ossenkopf · Mina Lee · Nicole Fitzgerald · Daniela Mihai · Jonathon Hare · Ali Zaidi · Alexander CowenRivers · Alana Marzoev · Eugene Kharitonov · Luyao Yuan · Tomasz Korbak · Paul Pu Liang · Yi Ren · Roberto Dessì · Peter Potash · Shangmin Guo · Tatsunori Hashimoto · Percy Liang · Julian Zubek · Zipeng Fu · SongChun Zhu · Adam Lerer 
2019 Poster: SPoC: Searchbased Pseudocode to Code »
Sumith Kulal · Panupong Pasupat · Kartik Chandra · Mina Lee · Oded Padon · Alex Aiken · Percy Liang 
2019 Poster: On the Accuracy of Influence Functions for Measuring Group Effects »
Pang Wei Koh · KaiSiang Ang · Hubert Teo · Percy Liang 
2019 Poster: Verified Uncertainty Calibration »
Ananya Kumar · Percy Liang · Tengyu Ma 
2019 Spotlight: Verified Uncertainty Calibration »
Ananya Kumar · Percy Liang · Tengyu Ma 
2018 : Natural Language Supervision »
Percy Liang 
2018 Poster: Loss Functions for Multiset Prediction »
Sean Welleck · Zixin Yao · Yu Gai · Jialin Mao · Zheng Zhang · Kyunghyun Cho 
2018 Poster: Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on ZeroOne Loss »
Stephen Mussmann · Percy Liang 
2018 Poster: Semidefinite relaxations for certifying robustness to adversarial examples »
Aditi Raghunathan · Jacob Steinhardt · Percy Liang 
2018 Poster: A RetrieveandEdit Framework for Predicting Structured Outputs »
Tatsunori Hashimoto · Kelvin Guu · Yonatan Oren · Percy Liang 
2018 Oral: A RetrieveandEdit Framework for Predicting Structured Outputs »
Tatsunori Hashimoto · Kelvin Guu · Yonatan Oren · Percy Liang 
2017 : (Invited Talk) Percy Liang: Learning with Adversaries and Collaborators »
Percy Liang 
2017 Demonstration: Babble Labble: Learning from Natural Language Explanations »
Braden Hancock · Paroma Varma · Percy Liang · Christopher Ré · Stephanie Wang 
2017 Poster: Learning Overcomplete HMMs »
Vatsal Sharan · Sham Kakade · Percy Liang · Gregory Valiant 
2017 Poster: Certified Defenses for Data Poisoning Attacks »
Jacob Steinhardt · Pang Wei Koh · Percy Liang 
2017 Poster: Saliencybased Sequential Image Attention with Multiset Prediction »
Sean Welleck · Jialin Mao · Kyunghyun Cho · Zheng Zhang 
2017 Poster: Unsupervised Transformation Learning via Convex Relaxations »
Tatsunori Hashimoto · Percy Liang · John Duchi 
2016 Workshop: Deep Learning for Action and Interaction »
Chelsea Finn · Raia Hadsell · David Held · Sergey Levine · Percy Liang 
2016 Workshop: Nonconvex Optimization for Machine Learning: Theory and Practice »
Hossein Mobahi · Anima Anandkumar · Percy Liang · Stefanie Jegelka · Anna Choromanska 
2016 Workshop: Reliable Machine Learning in the Wild »
Dylan HadfieldMenell · Adrian Weller · David Duvenaud · Jacob Steinhardt · Percy Liang 
2016 Poster: Unsupervised Risk Estimation Using Only Conditional Independence Structure »
Jacob Steinhardt · Percy Liang 
2015 : Sharing the "How" (and not the "What") »
Percy Liang 
2015 Workshop: Nonconvex Optimization for Machine Learning: Theory and Practice »
Anima Anandkumar · Niranjan Uma Naresh · Kamalika Chaudhuri · Percy Liang · Sewoong Oh 
2015 Demonstration: CodaLab Worksheets for Reproducible, Executable Papers »
Percy Liang · Evelyne Viegas 
2015 Poster: OntheJob Learning with Bayesian Decision Theory »
Keenon Werling · Arun Tejasvi Chaganty · Percy Liang · Christopher Manning 
2015 Spotlight: OntheJob Learning with Bayesian Decision Theory »
Keenon Werling · Arun Tejasvi Chaganty · Percy Liang · Christopher Manning 
2015 Poster: Estimating Mixture Models via Mixtures of Polynomials »
Sida Wang · Arun Tejasvi Chaganty · Percy Liang 
2015 Poster: Learning with Relaxed Supervision »
Jacob Steinhardt · Percy Liang 
2015 Poster: Calibrated Structured Prediction »
Volodymyr Kuleshov · Percy Liang 
2014 Workshop: Challenges in Machine Learning workshop (CiML 2014) »
Isabelle Guyon · Evelyne Viegas · Percy Liang · Olga Russakovsky · Rinat Sergeev · Gábor Melis · Michele Sebag · Gustavo Stolovitzky · Jaume Bacardit · Michael S Kim · Ben Hamner 
2014 Poster: Altitude Training: Strong Bounds for SingleLayer Dropout »
Stefan Wager · William S Fithian · Sida Wang · Percy Liang 
2014 Poster: Simple MAP Inference via LowRank Relaxations »
Roy Frostig · Sida Wang · Percy Liang · Christopher D Manning 
2013 Poster: Dropout Training as Adaptive Regularization »
Stefan Wager · Sida Wang · Percy Liang 
2013 Spotlight: Dropout Training as Adaptive Regularization »
Stefan Wager · Sida Wang · Percy Liang 
2012 Poster: Identifiability and Unmixing of Latent Parse Trees »
Percy Liang · Sham M Kakade · Daniel Hsu 
2009 Workshop: The Generative and Discriminative Learning Interface »
Simon LacosteJulien · Percy Liang · Guillaume Bouchard 
2009 Poster: Asymptotically Optimal Regularization in Smooth Parametric Models »
Percy Liang · Francis Bach · Guillaume Bouchard · Michael Jordan 
2008 Workshop: Speech and Language: Unsupervised LatentVariable Models »
Slav Petrov · Aria Haghighi · Percy Liang · Dan Klein 
2007 Poster: AgreementBased Learning »
Percy Liang · Dan Klein · Michael Jordan 
2007 Spotlight: AgreementBased Learning »
Percy Liang · Dan Klein · Michael Jordan 
2007 Poster: A Probabilistic Approach to Language Change »
Alexandre BouchardCôté · Percy Liang · Tom Griffiths · Dan Klein