Timezone: »
Mathematical reasoning is a fundamental aspect of human cognition that has been studied by scholars ranging from philosophers to cognitive scientists and neuroscientists. Mathematical reasoning involves analyzing complex information, identifying patterns and relationships, and drawing logical conclusions from evidence. It is central to many applications in science, engineering, finance, and everyday contexts. Recent advancements in large language models (LLMs) have unlocked new opportunities at the intersection of artificial intelligence and mathematical reasoning, ranging from new methods that solve complex problems or prove theorems, to new forms of human-machine collaboration in mathematics and beyond. Our proposed workshop is centered on the intersection of deep learning and mathematical reasoning, with an emphasis on, but not limited to, large language models. Our guiding theme is: "To what extent can machine learning models comprehend mathematics, and what applications could arise from this capability?'' To address this question, we aim to bring together a diverse group of scholars from different backgrounds, institutions, and disciplines in our workshop. By hosting this workshop, we hope to stimulate insightful discussions that will guide future research and applications in this rapidly expanding field.
Fri 7:00 a.m. - 7:30 a.m.
|
Invited Talk
(
Talk
)
|
🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Invited Talk 2
(
Talk
)
|
🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Invited Talk 3
(
Talk
)
|
🔗 |
Fri 8:30 a.m. - 9:30 a.m.
|
Panel Discussion
(
Panel
)
|
🔗 |
Fri 9:30 a.m. - 11:00 a.m.
|
Poster Session 1
(
Poster Session
)
|
🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Invited Talk 4
(
Talk
)
|
🔗 |
Fri 1:00 p.m. - 1:30 p.m.
|
Invited Talk 5
(
Talk
)
|
🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Invited Talk 6
(
Talk
)
|
🔗 |
Fri 2:00 p.m. - 3:00 p.m.
|
Poster Session 2
(
Poster Session
)
|
🔗 |
-
|
Navigating Beyond the Dead End: A Math Problem Solving Framework by Switching among Diverse Reasoning Thoughts
(
Poster
)
link »
As large language models (LLMs) have shown effectiveness with different prompting methods, such as Chain of Thought, Program of Thought, we find that these methods have formed a great complementarity to each other on math reasoning tasks. In this work, we propose XoT, an automatic problem solving framework by prompting LLMs with diverse reasoning thoughts. For each question, XoT always begins with selecting the most suitable method then executes each method iteratively. Within each iteration, XoT actively checks the validity of the generated answer and incorporates the feedback from external executors, allowing it to dynamically switch among different prompting methods. Through extensive experiments on 9 popular math reasoning datasets, we demonstrate the effectiveness of our proposed approach and thoroughly analyze the strengths of each module. Furthermore, empirical results suggest that our framework is orthogonal to recent work that makes improvements on single reasoning methods. By allowing method switching, XoT provides a fresh perspective on the collaborative integration of diverse reasoning thoughts in a unified framework. |
Tengxiao Liu · Qipeng Guo · Yuqing Yang · Xiangkun Hu · Yue Zhang · Xipeng Qiu · Zheng Zhang 🔗 |
-
|
Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search
(
Poster
)
link »
This work studies a central extremal graph theory problem inspired by a 1975 conjecture of Erdős, which aims to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this problem as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum---jump-starting the search for larger graphs using good graphs found at smaller sizes---we improve the state-of-the-art lower bounds for several sizes. Lastly, we propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs. |
Abbas Mehrabian · Ankit Anand · Hyunjik Kim · Nicolas Sonnerat · Tudor Berariu · Matej Balog · Gheorghe Comanici · Andrew Lee · Anian Ruoss · Anna Bulanova · Daniel Toyama · Sam Blackwell · Bernardino Romera-Paredes · Laurent Orseau · Petar Veličković · Anurag Murty Naredla · Joonkyung Lee · Adam Wagner · Doina Precup
|
-
|
Augmenting Large Language Models with Symbolic Rule Learning for Robust Numerical Reasoning
(
Poster
)
link »
While some prompting strategies have been proposed to elicit reasoning in Large Language Models (LLMs), numerical reasoning for machine reading comprehension remains a difficult challenge. We propose a neuro-symbolic approach that uses in-context learning with LLMs to decompose complex questions into simpler ones and symbolic learning methods to learn rules for recomposing partial answers. We evaluate it on different numerical subsets of the DROP benchmark; results show that it is competitive with DROP-specific SOTA models and significantly improves results over pure LLM prompting methods. Our approach boasts data efficiency, since it does not involve any additional training or fine-tuning. Additionally, the neuro-symbolic approach facilitates robust numerical reasoning; the model is faithful to the passage it has been presented, and provides interpretable and verifiable reasoning traces. |
Hadeel Al-Negheimish · Pranava Madhyastha · Alessandra Russo 🔗 |
-
|
Understanding Length Generalization by Thinking Like Transformers - Poster
(
Poster
)
link »
Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. In this work, we focus on length generalization, and we propose a unifying framework to understand when and how Transformers can be expected to length generalize on a given task. First, we show that there exist algorithmic tasks for which standard decoder-only Transformers trained from scratch naturally exhibit strong length generalization. For these tasks, we leverage the RASP programming language (Weiss et al., 2021) to show that the correct algorithmic solution which solves the task can be represented by a simple Transformer. We thus propose and give evidence for the RASP-Generalization Conjecture: Transformers tend to learn a length-generalizing solution if there exists a short RASP-L program that works for all input lengths. We then leverage our insights to give new scratchpad formats which yield strong length generalization on traditionally hard tasks (such as parity and addition). Overall, our work provides a novel perspective on the mechanisms of length generalization and the algorithmic capabilities of Transformers. |
Hattie Zhou · Arwen Bradley · Etai Littwin · Noam Razin · Omid Saremi · Joshua Susskind · Samy Bengio · Preetum Nakkiran 🔗 |
-
|
Temperature-scaled large language models for Lean proofstep prediction
(
Poster
)
link »
Leveraging the reasoning capabilities of large language models (LLMs) for theorem proving is a promising but challenging task because it requires in-domain finetunings on which LLMs are known to be prone to overfit. This issue is exacerbated by two properties that set theorem proving apart from more mainstream applications of LLMs: training data in formal environments like Lean or Isabelle is very scarce and evaluation benchmarks are prohibitively costly to be used extensively for hyperparameter search and model selection. In this work, we propose temperature scaling as a regularization method for multi-epoch training on small datasets. We explain its theoretical purpose heuristically and demonstrate its effectiveness empirically, obtaining state-of-the-art supervised tactic generation models for Lean 3 of sizes 1.5B, 7B and 13B parameters. Model selection based on temperature-scaled perplexity increases scores on theorem proving benchmarks by up to four percentage points. We provide detailed ablations and analyses of the proof search behaviors of the resulting models, allowing practitioners to pick optimal model sizes for their respective use cases. |
Fabian Gloeckle · Baptiste Roziere · Amaury Hayat · Gabriel Synnaeve 🔗 |
-
|
Probabilistic Abduction for Visual Abstract Reasoning via Learning Rules in Vector-symbolic Architectures
(
Poster
)
link »
Abstract reasoning is a cornerstone of human intelligence, and replicating it with artificial intelligence (AI) presents an ongoing challenge. This study focuses on efficiently solving Raven's progressive matrices (RPM), a visual test for assessing abstract reasoning abilities, by using distributed computation and operators provided by vector-symbolic architectures (VSA). Instead of hard-coding the rule formulations associated with RPMs, our approach can learn the VSA rule formulations (hence the name Learn-VRF) with just one pass through the training data. Yet, our approach, with compact parameters, remains transparent and interpretable. Learn-VRF yields accurate predictions on I-RAVEN's in-distribution data, and exhibits strong out-of-distribution capabilities concerning unseen attribute-rule pairs, significantly outperforming pure connectionist baselines including large language models. |
Michael Hersche · Francesco Di Stefano · Thomas Hofmann · Abu Sebastian · Abbas Rahimi 🔗 |
-
|
Reinforcement Learning in Control Theory: A New Approach to Mathematical Problem Solving
(
Poster
)
link »
One of the central questions in control theory is achieving stability through feedback control. This paper introduces a novel approach that combines Reinforcement Learning (RL) with mathematical analysis to address this challenge, with a specific focus on the Sterile Insect Technique (SIT) system. The objective is to find a feedback control that stabilizes the mosquito population model. Despite the mathematical complexities and the absence of known solutions for this specific problem, our RL approach identifies a candidate solution. This study underscores the synergy between AI and mathematics, opening new avenues for tackling intricate mathematical problems. |
Kala Bidi · Jean-Michel Coron · Amaury Hayat · Nathan Lichtlé 🔗 |
-
|
Can We Count on Deep Learning: Exploring and Characterizing Combinatorial Structures Using Machine Learning
(
Poster
)
link »
With its exceptional pattern matching ability, deep learning has proven to be a powerful tool in a range of scientific domains. This is increasingly true in research mathematics, where recent work has demonstrated deep learning's ability to highlight subtle connections between mathematical objects that might escape a human expert. In this work we describe a simple method to help domain experts characterize a set of mathematical objects using deep learning. Such characterization problems often occur when some particular class of function, space, linear representation, etc. naturally emerges in calculations or other means but lacks a simple description. The goal is to find simple rules that also ideally shed light on the underlying mathematics. Our method, which we call Feature Attribution Clustering for Exploration (FACE), clusters the feature attribution representations extracted from a trained model, arriving at a short list of prototype attributions that the domain expert can then try to convert into formal and rigorous rules. As a case study, we use our method to derive a new result in combinatorics by characterizing a subset of 0-1 matrices that corresponds to certain representations of permutations known as two-sided ordered words. |
Helen Jenne · Herman Chau · Davis Brown · Jackson Warley · Tim Doster · Henry Kvinge 🔗 |
-
|
Solving Math Word Problems by Combining Language Models With Symbolic Solvers
(
Poster
)
link »
Automatically generating high-quality step-by-step solutions to math word problems has many applications in education. Recently, combining large language models (LLMs) with external tools to perform complex reasoning and calculation has emerged as a promising direction for solving math word problems, but prior approaches such as Program-Aided Language model (PAL) are biased towards simple procedural problems and less effective for problems that require declarative reasoning. We propose an approach that combines an LLM that can incrementally formalize word problems as a set of variables and equations with an external symbolic solver that can solve the equations. Our approach achieves comparable accuracy to the original PAL on the GSM8K benchmark of math word problems and outperforms PAL by an absolute 20% on ALGEBRA, a new dataset of more challenging word problems extracted from Algebra textbooks. Our work highlights the benefits of using declarative and incremental representations when interfacing with an external tool for solving complex math word problems. Our data and prompts are publicly available at this https URL. |
Joy He-Yueya · Gabriel Poesia · Rose Wang · Noah Goodman 🔗 |
-
|
WAMP: A Competition-Level Dataset for Assessing the Mathematical Reasoning Capabilities of LLMs
(
Poster
)
link »
Recent large language models (LLMs) have shown indications of nontrivial mathematical reasoning ability. However, it is unclear how, especially for challenging math problems, the reasoning procedures are generated inside a model. In this paper, we propose a challenging dataset for better understanding the mathematical reasoning capabilities of LLMs. The Web of Annotated Math Problems, or WAMP, consists of competition-level math problems annotated with ''knowledge pieces'', or general math facts, and ''hints'', or problem-specific tricks. This ''web'' of problems is structured such that each problem is represented by a node in the graph, and is connected to its associated nodes of knowledge pieces and hints. In addition to graph construction, we manually annotate step-wise solutions for each problem. As an initial step, we apply this dataset to investigate whether augmenting the problem with relevant knowledge pieces and hints can improve model performance. |
Yujun Mao · Yoon Kim · Yilun Zhou 🔗 |
-
|
AI for Mathematics: A Cognitive Science Perspective
(
Poster
)
link »
Mathematics is one of the most powerful conceptual systems developed and used by the human species. Dreams of automated mathematicians have a storied history in artificial intelligence (AI). Rapid progress in AI, particularly propelled by advances in large language models (LLMs), has sparked renewed, widespread interest in building such systems. In this work, we reflect on these goals from a \textit{cognitive science} perspective. We call attention to several classical and ongoing research directions from cognitive science, which we believe are valuable for AI practitioners to consider when seeking to build truly human (or superhuman)-level mathematical systems. We close with open discussions and questions that we believe necessitate a multi-disciplinary perspective---cognitive scientists working in tandem with AI researchers and mathematicians---as we move toward better mathematical AI systems which not only help us push the frontier of the mathematics, but offer glimpses into how we as humans are even capable of such great cognitive feats. |
Cedegao (Ced) Zhang · Katie Collins · Adrian Weller · Josh Tenenbaum 🔗 |
-
|
A Language-Agent Approach to Formal Theorem-Proving
(
Poster
)
link »
Language agents, which use a large language model (LLM) capable of in-context learning to interact with an external environment, have emerged as a promising approach to control tasks. We present a language-agent approach that offers state-of-the-art performance in formal theorem-proving. Our method, COPRA, uses a high-capacity, black-box LLM (GPT-4) as part of a policy for a stateful backtracking search. During the search, the policy can select proof tactics and retrieve lemmas and definitions from an external database. Each selected tactic is executed in the underlying proof framework, and the execution feedback is used to build the prompt for the next policy invocation. The search also tracks selected information from its history and uses it to reduce hallucinations and unnecessary LLM queries.We evaluate COPRA on the miniF2F benchmark for Lean and a set of Coq tasks from the Compcert project. On these benchmarks, COPRA is significantly better than one-shot invocations of GPT-4, as well as state-of-the-art models fine-tuned on proof data, at finding correct proofs quickly. |
Amitayush Thakur · Yeming Wen · Swarat Chaudhuri 🔗 |
-
|
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
(
Poster
)
link »
Although Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive skills in various domains, their ability for mathematical reasoning within visual contexts has not been formally examined. Equipping LLMs and LMMs with this capability is vital for general-purpose AI assistants and showcases promising potential in education, data analysis, and scientific discovery. To bridge this gap, we present MathVista, a benchmark designed to amalgamate challenges from diverse mathematical and visual tasks. We first taxonomize the key task types, reasoning skills, and visual contexts from the literature to guide our selection from 28 existing math-focused and visual question answering datasets. Then, we construct three new datasets, IQTest, FunctionQA, and PaperQA, to accommodate for missing types of visual contexts. The problems featured often require deep visual understanding beyond OCR or image captioning, and compositional reasoning with rich domain-specific tools, thus posing a notable challenge to existing models. We conduct a comprehensive evaluation of 11 prominent open-source and proprietary foundation models (LLMs, LLMs augmented with tools, and LMMs), and early experiments with GPT-4V. The best-performing model, Multimodal Bard, achieves only 58% of human performance (34.8% vs 60.3%), indicating ample room for further improvement. Given this significant gap, MathVista fuels future research in the development of general-purpose AI agents capable of tackling mathematically intensive and visually rich real-world tasks. Preliminary tests show that MathVista also presents challenges to GPT-4V, underscoring the benchmark's importance. |
Pan Lu · Hritik Bansal · Tanglin Xia · Jiacheng Liu · Chunyuan Li · Hannaneh Hajishirzi · Hao Cheng · Kai-Wei Chang · Michel Galley · Jianfeng Gao 🔗 |
-
|
ToolDec: Syntax Error-Free and Generalizable Tool Use for LLMs via Finite-State Decoding
(
Poster
)
link »
Large Language Models (LLMs) have shown promising capabilities in using external tools.However, existing approaches rely on fine-tuning or in-context learning to use tools, which make syntactic mistakes and are difficult to generalize.In this paper, we propose ToolDec, a finite-state machine-guided decoding algorithm for tool-augmented LLMs.ToolDec eliminates tool-related errors by ensuring valid tool names and type-conforming arguments.Furthermore, ToolDec enables LLM to effectively select tools using only the information contained in their names, with no need for tool-specific fine-tuning.Our experiments on multiple word problem datasets show that ToolDec reduces syntactic errors to zero, consequently achieving significantly better performance and as much as a 2x speedup.We also show that ToolDec achieves superior generalization performance on unseen tools, performing up to 8x better than the baseline. |
Hongqiao Chen · Kexun Zhang · Lei Li · William Yang Wang 🔗 |
-
|
Teaching small transformers to rewrite ZX diagrams
(
Poster
)
link »
ZX calculus is a graphical language for reasoning about linear maps. Maps are represented as graphs, and reasoning amounts to graph rewrites. The main applications of ZX calculus are in quantum computation. We train small transformers to simplify ZX graphs, i.e. perform resource optimisation of quantum circuits. Preliminary experiments show that transformers can be trained to simplify CNOT and Clifford circuits with high accuracy. These are the simplest kinds of ZX graphs, in the sense that there exists an efficient rewrite strategy. We also show evidence that transformers learn to simplify the more complex Clifford+T graphs, for which in general there does not exist an efficient simplification algorithm. |
Francois Charton · Alexandre Krajenbrink · Konstantinos Meichanetzidis · Richie Yeung 🔗 |
-
|
Teaching Arithmetic to Small Transformers
(
Poster
)
link »
Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how even small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and parameter scaling. Additionally, we discuss the challenges associated with length generalization. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction loss for rapidly eliciting arithmetic capabilities. |
Nayoung Lee · Kartik Sreenivasan · Jason Lee · Kangwook Lee · Dimitris Papailiopoulos 🔗 |
-
|
Magnushammer: A Transformer-Based Approach to Premise Selection
(
Poster
)
link »
We present Magnushammer: a novel approach to premise selection -- a crucial task in automated theorem proving. Traditionally, symbolic methods that rely on domain knowledge and engineering effort are applied to this task. In contrast, this work demonstrates that contrastive training with the transformer architecture can achieve higher-quality retrieval of relevant premises, without the domain knowledge or feature engineering overhead. Magnushammer outperforms the most advanced and widely used automation tool in interactive theorem proving: Sledgehammer. On the PISA and miniF2F benchmarks Magnushammer achieves $59.5\%$ (against $38.3\%$) and $34.0\%$ (against $20.9\%$) success rates, respectively. By combining Magnushammer with a language-model-based theorem prover, we further improve the state-of-the-art proof success rate from $57.0\%$ to $71.0\%$ on the PISA benchmark. Moreover, we develop and open source a novel, large dataset for premise selection.
|
Maciej Mikuła · Szymon Antoniak · Szymon Tworkowski · Bartosz Piotrowski · Albert Q. Jiang · Jin Zhou · Christian Szegedy · Łukasz Kuciński · Piotr Miłoś · Yuhuai Wu 🔗 |
-
|
SIRD: Symbolic Integration Rules Dataset
(
Poster
)
link »
Advancements in neural networks and computer hardware lead to new use cases for deep learning in the natural sciences every day. Even though symbolic mathematics tasks have been explored, symbolic integration only has a few studies using black box models and currently lacks explainability. Symbolic integration is a challenging search problem and we obtain the final result by applying different integration rules such as integration by parts or u-substitution. We address this by proposing a novel and interpretable approach to perform symbolic integration using deep learning through integral rule prediction. To our knowledge, we contribute the first Symbolic Integration Rules Dataset (SIRD), comprising 2 million distinct functions and integration rule pairs. For complex rules such as u-substitution and integration by parts, our dataset also includes the expression to be substituted or used in the rule application. We also train a transformer model on our proposed dataset and incorporate it into Sympy's integral_steps function, resulting in 6 times fewer branches explored by allowing our model to guide the depth-first-search procedure. |
Vaibhav Sharma · Abhinav Nagpal · Muhammed Fatih Balin 🔗 |
-
|
MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning
(
Poster
)
link »
Reasoning in mathematical domains remains a significant challenge for relatively small language models (LMs). Many current methods focus on specializing LMs in mathematical reasoning and rely heavily on knowledge distillation from powerful but inefficient large LMs (LLMs). In this work, we explore a new direction that avoids over-reliance on LLM teachers, introducing a multi-view fine-tuning method that efficiently exploits existing mathematical problem datasets with diverse annotation styles. Our approach uniquely considers the various annotation formats as different "views" and leverages them in training the model. By postpending distinct instructions to input questions, models can learn to generate solutions in diverse formats in a flexible manner. Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches that utilize knowledge distillation, as well as carefully established baselines. Additionally, the proposed method grants the models promising generalization ability across various views and datasets, and the capability to learn from inaccurate or incomplete noisy data. We hope our multi-view training paradigm could inspire future studies in other machine reasoning domains. |
Zhenwen Liang · Dian Yu · Xiaoman Pan · Wenlin Yao · Qingkai Zeng · Xiangliang Zhang · Dong Yu 🔗 |
-
|
Exploration with Principles for Diverse AI Supervision
(
Poster
)
link »
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. Even state-of-the-art AI models like ChatGPT depend on fine-tuning through human demonstrations, demanding extensive human input and domain expertise. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. To address this limitation, we propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data. Drawing inspiration from the principles of unsupervised reinforcement learning (RL) pretraining, EAI achieves exploration within the natural language space. We accomplish this by harnessing large language models to assess the novelty of generated content. Our approach employs two key components: an actor that generates novel content and a critic that evaluates the generated content, offering critiques to guide the actor. Empirical evaluations demonstrate that EAI significantly boosts model performance on complex reasoning tasks, addressing the limitations of human-intensive supervision. |
Hao Liu · Matei A Zaharia · Pieter Abbeel 🔗 |
-
|
EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context Learning
(
Poster
)
link »
Language models are achieving impressive performance on various tasks by aggressively adopting inference-time prompting techniques, such as zero-shot and few-shot prompting. In this work, we introduce EchoPrompt, a simple yet effective approach that prompts the model to rephrase its queries before answering them. EchoPrompt is adapted for both zero-shot and few-shot in-context learning with standard and chain-of-thought prompting. Experimental results show that EchoPrompt yields substantial improvements across all these settings for four families of causal language models. These improvements are observed across various numerical reasoning (e.g. GSM8K, SVAMP), reading comprehension (e.g. DROP), and logical reasoning (e.g. Coin Flipping) tasks. On average, EchoPrompt improves the Zero-shot-CoT performance of code-davinci-002 by 5% in numerical tasks and 13% in reading comprehension tasks. Our empirical results indicate that EchoPrompt is an effective technique that enhances in-context learning performance. We recommend incorporating EchoPrompt into various baseline prompting strategies to achieve performance boosts. |
Raja Sekhar Reddy Mekala · Yasaman Razeghi · Sameer Singh 🔗 |
-
|
TinyGSM: achieving 80% on GSM8k with one billion parameters
(
Poster
)
link »
Small models offer various computational advantages, yet the extent to which size is critical for problem-solving abilities remains an open question. This work studies the performance of small models on mathematical reasoning. Specifically, for solving math word problems, we find that a 1.3B model can achieve 80.1% accuracy on GSM8K, outperforming existing models that are orders of magnitude larger, and even rivaling the performance of the GPT-3.5-turbo teacher model from which the training data is generated. Our approach is simple and has two key components: The first is the use of a GPT-3.5-turbo-generated synthetic dataset of math word problem with solutions, which we will fully release. The second component is the use of a verifier, which selects the final outputs from multiple candidate generations. |
Bingbin Liu · Sebastien Bubeck · Ronen Eldan · Janardhan Kulkarni · Yuanzhi Li · Anh Nguyen · Rachel Ward · Yi Zhang 🔗 |
-
|
Basic Arithmetic Properties in the Space of Language Model Prompts
(
Poster
)
link »
Large pre-trained neural language models (LLMs) that can effectively utilize enormous amounts of unlabeled textual data have recently changed the whole field of Natural Language Processing. By utilizing prompting techniques enabled by the in-context learning capabilities, LLMs have been shown to perform on par with dedicated models trained for downstream tasks. One such task is numerical reasoning and, in particular, the ability to conduct basic arithmetic operations. The question we wish to answer is whether the basic properties of arithmetic operations, such as the commutative property, hold in the space of LLM prompts - does asking the LLM model to compute 13+37 vs 37+13 result, on average, in the same outcome? In contrast to previous works, which reported Accuracy only, we take a closer look (MAE, Pearson's R) at the error distribution to better understand the performance with regards to scaling laws. |
Mateusz Krubiński 🔗 |
-
|
llmstep: LLM proofstep suggestions in Lean
(
Poster
)
link »
We present $\texttt{llmstep}$, a tool for suggesting proof steps with a language model in the Lean 4 proof assistant. $\texttt{llmstep}$ is a Lean 4 tactic that sends a user's proof state to a server hosting a language model. The language model generates suggestions, which are checked in Lean and displayed to a user in their development environment. We provide a baseline language model, along with code for fine-tuning and evaluation to support further development. We provide server implementations that run on CPU, a CUDA GPU, or a Google Colab notebook, as a step towards fast, effective language model suggestions for any user.
|
Sean Welleck · Rahul Saha 🔗 |
-
|
Lemur: Integrating Large Language Models in Automated Program Verification
(
Oral
)
link »
The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that typically demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks. |
Haoze Wu · Clark Barrett · Nina Narodytska 🔗 |
-
|
Learning Multi-Step Reasoning by Solving Arithmetic Tasks
(
Poster
)
link »
The original version of this paper has been accepted at ACL 2023 as a main conference paper. This paper is modified based on the ACL verison to meet the requirement of NeurIPS 2023 Workshop MATH-AI.Mathematical reasoning is regarded as a neces- sary ability for Language Models (LMs). Recent works demonstrate large LMs’ impressive performance in solving math problems. The success is attributed to their Chain-of-Thought (CoT) reasoning abilities, i.e., the ability to decompose complex questions into step-by-step reasoning chains, but such ability seems only to emerge from models with abundant parameters. This work investigates how to incorporate relatively small LMs with the capabilities of multi-step reasoning. We propose to inject such abilities by continually pre-training LMs on a synthetic dataset MsAT which is composed of Multi-step Arithmetic Tasks. Our experiments on four math word problem datasets show the effectiveness of the proposed method in enhancing LMs’ math reasoning abilities. |
Tianduo Wang · Wei Lu 🔗 |
-
|
Continual Learning and Out of Distribution Generalization in a Systematic Reasoning Task
(
Poster
)
link »
Humans have the remarkable ability to rapidly learn new problem solving strategies from a narrow range of examples and extend to examples out of the distribution (OOD) used in learning, but such generalization remains a challenge for neural networks. This seems especially important for learning new mathematical techniques, which apply to huge problem spaces (e.g. all real numbers). We explore this limitation by training neural networks on strategies for solving specified cells in $6\times6$ Sudoku puzzles using a novel curriculum of tasks that build upon each other. We train transformers sequentially on two preliminary tasks, then assess OOD generalization of a more complex solution strategy from a range of restricted training distributions. Baseline models master the training distribution, but fail to generalize to OOD data. However, we find that a combination of extensions is sufficient to support highly accurate and reliable OOD generalization. These results suggest directions for improving the robustness of larger transformer models under the highly imbalanced data distributions provided by natural data sets.
|
Mustafa Abdool · Andrew Nam · James McClelland 🔗 |
-
|
Vertical AI-driven Scientific Discovery
(
Poster
)
link »
Automating scientific discovery has been a grand goal of Artificial Intelligence (AI) and will bring tremendous societal impact if it succeeds. Despite exciting progress, most endeavor in learning scientific equations from experiment data focuses on the horizontal discovery paths, i.e., they directly search for the best equation in the full hypothesis space. Horizontal paths are challenging because of the associated exponentially large search space. Our work explores an alternative vertical path, which builds scientific equations in an incremental way, starting from one that models data in control variable experiments in which most variables are held as constants. It then extends expressions learned in previous generations via adding new independent variables, using new control variable experiments in which these variables are allowed to vary. This vertical path was motivated by human scientific discovery processes. Experimentally, we demonstrate that such vertical discovery paths expedite symbolic regression. It also improves learning physics models describing nano-structure evolution in computational materials science. |
Yexiang Xue 🔗 |
-
|
Spoken Language Understanding Evaluations for Home-Based Basic Math Learning
(
Poster
)
link »
Enriching the quality of early childhood education with interactive math learning at home systems, empowered by recent advances in conversational AI technologies, is slowly becoming a reality. With this motivation, we implement a multimodal dialogue system to support play-based learning experiences at home, guiding kids to master basic math concepts. This work explores Spoken Language Understanding (SLU) pipeline within a task-oriented dialogue system, with cascading Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) components evaluated on our home deployment data with kids going through gamified math learning activities. We validate the advantages of a multi-task architecture for NLU and experiment with a diverse set of pretrained language representations for Intent Recognition and Entity Extraction in the math learning domain. To recognize kids' speech in realistic home environments, we investigate several ASR systems, including the Google Cloud and the latest open-source Whisper solutions with varying model sizes. We evaluate the SLU pipeline by testing our best-performing NLU models on noisy ASR output to inspect the challenges of understanding children for math learning in authentic homes. |
Eda Okur · Saurav Sahay · Nachman 🔗 |
-
|
LLMs vs ITP
(
Poster
)
link »
Previous work has shown that large language models (LLMs) can function as searchable mathematical knowledge bases, as during their pre/finetuning phase, they ingest a large number of mathematical books and articles. The only other searchable mathematical knowledge base of comparable size and range of mathematical difficulty levels are the libraries that are associated with interactive theorem provers (ITPs). This work is the first that compares these two forms of mathematical knowledge bases. Aside from intrinsic interest in such a comparison, such a comparison is necessary to lay the groundwork for the accelerating trend of merging LLM and ITP technology. We focus on GPT-4 as the LLM of choice and use Claude 2 as a fallback if GPT-4 cannot solve a task, and compare the combined mathematical knowledge bases of GPT-4 and Claude 2 to proof libraries associated with ITPs like Isabelle, Coq, and Lean. We release our dataset publicly. |
Simon Frieder · Martin Trimmel · Rashid Alawadhi · Klaus Gy 🔗 |
-
|
Towards Large Language Models as Copilots for Theorem Proving in Lean
(
Poster
)
link »
Theorem proving is an important challenge for large language models (LLMs), as formal proofs can be checked rigorously by proof assistants such as Lean, leaving no room for hallucination. Existing LLM-based provers try to prove theorems in a fully autonomous mode without human intervention. In this mode, they struggle with novel and challenging theorems, for which human insights may be critical. In this paper, we explore LLMs as copilots that assist humans in proving theorems. We introduce Lean Copilot, a framework for running neural network inference in Lean. It enables programmers to build various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users. Using Lean Copilot, we build tools for suggesting proof steps and completing intermediate proof goals using LLMs. Experimental results demonstrate the effectiveness of our method in assisting humans compared to existing rule-based proof automation in Lean. |
Peiyang Song · Kaiyu Yang · Animashree Anandkumar 🔗 |
-
|
CNN models' sensitivity to numerosity concepts
(
Poster
)
link »
The nature of number is a classic question in the philosophy of mathematics. Cognitive science research shows that numbers are mentally represented as magnitudes organized as a mental number line (MNL). Here we ask whether CNN models, in learning to classify images, learn about number and numerosity 'for free'. This was the case. The models show the distance, size, and ratio effects that are the signatures of magnitude representations in humans. An MDS analysis of their latent representations found a close resemblance to the MNL documented in people. These findings challenge the developmental science proposal that numbers are part of the ‘core knowledge’ that all human infants possess. |
Neha Upadhyay · Sashank Varma 🔗 |
-
|
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
(
Poster
)
link »
Large language models (LLMs) have achieved remarkable progress in solving various natural language processing tasks due to emergent reasoning abilities. However, LLMs have inherent limitations as they are incapable of accessing up-to-date information (stored on the Web or in task-specific knowledge bases), using external tools, and performing precise mathematical and logical reasoning. In this paper, we present Chameleon, an AI system that mitigates these limitations by augmenting LLMs with plug-and-play modules for compositional reasoning. Chameleon synthesizes programs by composing various tools (e.g., LLMs, off-the-shelf vision models, web search engines, Python functions, and heuristic-based modules) for accomplishing complex reasoning tasks. At the heart of Chameleon is an LLM-based planner that assembles a sequence of tools to execute to generate the final response. We showcase the effectiveness of Chameleon on two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. Chameleon, powered by GPT-4, achieves an 86.54% overall accuracy on ScienceQA, improving the best published few-shot result by 11.37%. On TabMWP, GPT-4-powered Chameleon improves the accuracy by 17.0%, lifting the state of the art to 98.78%. Our analysis also shows that the GPT-4-powered planner exhibits more consistent and rational tool selection via inferring potential constraints from instructions, compared to a ChatGPT-powered planner. |
Pan Lu · Baolin Peng · Hao Cheng · Michel Galley · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Jianfeng Gao 🔗 |
-
|
SCIBENCH: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
(
Poster
)
link »
Recent advances in Large Language Models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only contain problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations.To address these issues, this paper introduces an expansive benchmark suite Scibench that aims to systematically examine the reasoning capabilities required for solving complex scientific problems. Scibench contains two datasets: an open set featuring a range of collegiate-level scientific problems, and a closed set comprising problems from undergraduate-level exams.Based on the two datasets, we conduct an in-depth benchmarking study of five representative LLMs with various prompting strategies. Furthermore, through a detailed user study, we show that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. |
Xiaoxuan Wang · Ziniu Hu · Pan Lu · Yanqiao Zhu · Jieyu Zhang · Satyen Subramaniam · Arjun Loomba · Shichang Zhang · Yizhou Sun · Wei Wang 🔗 |
-
|
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting
(
Poster
)
link »
Prior work has combined chain-of-thought prompting in large language models (LLMs) with programmatic representations to perform reasoning. While such an approach works well for tasks that only require forward reasoning (e.g., straightforward arithmetic), it is less effective for problems that require more sophisticated planning and search. In this paper, we propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of LLMs. We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.By offloading the actual reasoning task to an automated theorem prover, our approach can guarantee the correctness of the answer with respect to the parsed specification and avoid planning errors in the solving process.We evaluate SatLM on 6 datasets and show that it consistently outperforms program-aided LMs in an imperative paradigm.In particular, SatLM outperforms program-aided LMs by more than 20% on a challenging subset of the GSM arithmetic reasoning dataset; SatLM also achieves a new SoTA on LSAT and BoardgameQA. |
Xi Ye · Qiaochu Chen · Isil Dillig · Greg Durrett 🔗 |
-
|
ARB: Advanced Reasoning Benchmark for Large Language Models
(
Poster
)
link »
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet reaching expert performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems in multiple fields. ARB presents a more challenging test than prior benchmarks, featuring problems in mathematics, physics, biology, chemistry, and law. As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. We evaluate recent models such as GPT-4 and Claude on ARB and demonstrate that current models score well below 50% on more demanding tasks. In order to improve both automatic and assisted evaluation capabilities, we introduce a rubric-based evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps. We find promising agreement between annotators and GPT-4 rubric evaluation scores. |
Tom Sawada · Daniel Paleka · Alexander Havrilla · Pranav Tadepalli · Paula Vidas · Alexander Kranias · John Nay · Kshitij Gupta · Aran Komatsuzaki 🔗 |
-
|
Learning the greatest divisor - Explainable predictions in transformers
(
Oral
)
link »
We train small transformers to calculate the greatest common divisor (GCD) of two positive integers, and show that their predictions are fully explainable. During training, models learn a list $\mathcal D$ of divisors, and predict the largest element of $\mathcal D$ that divides both inputs. We also show that training distributions have a large impact on performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ out of $100$). Training from log-uniform operands boosts performance to $73$ correct GCD, and training from a log-uniform distribution of GCD to $91$.
|
Francois Charton 🔗 |
-
|
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text - Poster
(
Poster
)
link »
There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, open-sourced and released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models. |
Keiran Paster · Marco Dos Santos · Zhangir Azerbayev · Jimmy Ba 🔗 |
-
|
Solving Math Word Problems with Reexamination
(
Poster
)
link »
Math word problem (MWP) solving aims to understand the descriptive math problem and calculate the result, for which previous efforts are mostly devoted to upgrade different technical modules. This paper brings a different perspective of reexamination process during training by introducing a pseudo-dual task to enhance the MWP solving.We propose a pseudo-dual (PseDual) learning scheme to model such process, which is model-agnostic thus can be adapted to any existing MWP solvers. The pseudo-dual task is specifically defined as filling the numbers in the expression back into the original word problem with numbers masked. To facilitate the effective joint learning of the two tasks, we further design a scheduled fusion strategy for the number infilling task, which smoothly switches the input from the ground-truth math expressions to the predicted ones. Our pseudo-dual learning scheme has been tested and proven effective when being equipped in several representative MWP solvers through empirical studies. The codes and trained models are available at: \url{https://github.com/xxx/xxx}. |
Yi Bin · WENHAO SHI · Yujuan Ding · Yang Yang · See-Kiong Ng 🔗 |
-
|
Discovering Lyapunov functions with transformers
(
Poster
)
link »
We consider a long-standing open problem in mathematics: discovering the Lyapunov functions that control the global stability of dynamical systems. We propose a method for generating training data, and train sequence-to-sequence transformers to predict the Lyapunov functions of polynomial and non-polynomial systems with high accuracy. We also introduce a new baseline for this problem, and show that our models achieve state-of-the-art results, and outperform approximation based techniques and sum-of-square algorithmic routines. |
Alberto Alfarano · Francois Charton · Amaury Hayat 🔗 |
-
|
Llemma: An Open Language Model For Mathematics
(
Poster
)
link »
We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known openly released models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments. |
Zhangir Azerbayev · Hailey Schoelkopf · Keiran Paster · Marco Dos Santos · Stephen McAleer · Albert Q. Jiang · Jia Deng · Stella Biderman · Sean Welleck 🔗 |
-
|
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text - Oral
(
Oral
)
link »
There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, open-sourced and released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models. |
🔗 |
-
|
Understanding Length Generalization by Thinking Like Transformers - Oral
(
Oral
)
link »
Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. In this work, we focus on length generalization, and we propose a unifying framework to understand when and how Transformers can be expected to length generalize on a given task. First, we show that there exist algorithmic tasks for which standard decoder-only Transformers trained from scratch naturally exhibit strong length generalization. For these tasks, we leverage the RASP programming language (Weiss et al., 2021) to show that the correct algorithmic solution which solves the task can be represented by a simple Transformer. We thus propose and give evidence for the RASP-Generalization Conjecture: Transformers tend to learn a length-generalizing solution if there exists a short RASP-L program that works for all input lengths. We then leverage our insights to give new scratchpad formats which yield strong length generalization on traditionally hard tasks (such as parity and addition). Overall, our work provides a novel perspective on the mechanisms of length generalization and the algorithmic capabilities of Transformers. |
🔗 |
-
|
Lemur: Integrating Large Language Models in Automated Program Verification - Poster
(
Poster
)
link »
The demonstrated code-understanding capability of LLMs raises the question of whether they can be used for automated program verification, a task that typically demands high-level abstract reasoning about program properties that is challenging for verification tools. We propose a general methodology to combine the power of LLMs and automated reasoners for automated program verification. We formally describe this methodology as a set of derivation rules and prove its soundness. We instantiate the calculus as a sound automated verification procedure, which led to practical improvements on a set of synthetic and competition benchmarks. |
🔗 |
-
|
Learning the greatest divisor - Explainable predictions in transformers - Poster
(
Poster
)
link »
We train small transformers to calculate the greatest common divisor (GCD) of two positive integers, and show that their predictions are fully explainable. During training, models learn a list $\mathcal D$ of divisors, and predict the largest element of $\mathcal D$ that divides both inputs. We also show that training distributions have a large impact on performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ out of $100$). Training from log-uniform operands boosts performance to $73$ correct GCD, and training from a log-uniform distribution of GCD to $91$.
|
🔗 |
Author Information
Zhenwen Liang (University of Notre Dame)
Albert Q. Jiang (University of Cambridge; Mistral AI)
Katie Collins (University of Cambridge)
Pan Lu (UCLA)
Kaiyu Yang (California Institute of Technology)

Kaiyu Yang is a postdoctoral researcher at Caltech in the Computing + Mathematical Sciences (CMS) Department, working with Prof. Anima Anandkumar. His research aims to make machine learning capable of symbolic reasoning. It includes (1) applying machine learning to symbolic reasoning tasks, such as automated theorem proving; and (2) introducing symbolic components into machine learning models to make them more interpretable, verifiable, and data-efficient. In addition, he has also worked on constructing and analyzing machine learning datasets, especially focusing on fairness, privacy, and mitigating dataset bias. His research is recognized with a Siebel Scholar award. Before joining Caltech, he received his Ph.D. from the Department of Computer Science at Princeton University, advised by Prof. Jia Deng.
Sean Welleck (University of Washington)
James McClelland (Stanford University)
More from the Same Authors
-
2021 : NaturalProofs: Mathematical Theorem Proving in Natural Language »
Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho -
2021 : IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning »
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · Song-Chun Zhu -
2021 : Theorem-Aware Geometry Problem Solving with Symbolic Reasoning and Theorem Prediction »
Pan Lu · Ran Gong · Shibiao Jiang · Liang Qiu · Siyuan Huang · Xiaodan Liang · Song-Chun Zhu · Ran Gong -
2021 : Towards Diagram Understanding and Cognitive Reasoning in Icon Question Answering »
Pan Lu · Liang Qiu · Jiaqi Chen · Tanglin Xia · Yizhou Zhao · Wei Zhang · Zhou Yu · Xiaodan Liang · Song-Chun Zhu -
2021 : Can Network Flatness Explain the Training Speed-Generalisation Connection? »
Albert Q. Jiang · Clare Lyle · Lisa Schut · Yarin Gal -
2022 : Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks »
Yuxuan (Effie) Li · James McClelland -
2022 : MWP-BERT: A Numeracy-augmented Pre-trained Encoder for Math Word Problems »
Zhenwen Liang · Jipeng ZHANG · Lei Wang · Wei QIN · Jie Shao · Xiangliang Zhang -
2022 : Learn to Select Good Examples with Reinforcement Learning for Semi-structured Mathematical Reasoning »
Pan Lu · Liang Qiu · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Tanmay Rajpurohit · Peter Clark · Ashwin Kalyan -
2022 : Learning to Reason With Relational Abstractions »
Andrew Nam · James McClelland · Mengye Ren · Chelsea Finn -
2022 : Out-of-Distribution Generalization in Algorithmic Reasoning Through Curriculum Learning »
Andrew Nam · Mustafa Abdool · Trevor Maxfield · James McClelland -
2022 : LILA: A Unified Benchmark for Mathematical Reasoning »
Swaroop Mishra · Matthew Finlayson · Pan Lu · Leonard Tang · Sean Welleck · Chitta Baral · Tanmay Rajpurohit · Oyvind Tafjord · Ashish Sabharwal · Peter Clark · Ashwin Kalyan -
2023 : AI for Mathematics: A Cognitive Science Perspective »
Cedegao (Ced) Zhang · Katie Collins · Adrian Weller · Josh Tenenbaum -
2023 : MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts »
Pan Lu · Hritik Bansal · Tanglin Xia · Jiacheng Liu · Chunyuan Li · Hannaneh Hajishirzi · Hao Cheng · Kai-Wei Chang · Michel Galley · Jianfeng Gao -
2023 : Magnushammer: A Transformer-Based Approach to Premise Selection »
Maciej Mikuła · Szymon Antoniak · Szymon Tworkowski · Bartosz Piotrowski · Albert Q. Jiang · Jin Zhou · Christian Szegedy · Łukasz Kuciński · Piotr Miłoś · Yuhuai Wu -
2023 : MinT: Boosting Generalization in Mathematical Reasoning via Multi-View Fine-Tuning »
Zhenwen Liang · Dian Yu · Xiaoman Pan · Wenlin Yao · Qingkai Zeng · Xiangliang Zhang · Dong Yu -
2023 : llmstep: LLM proofstep suggestions in Lean »
Sean Welleck · Rahul Saha -
2023 : Continual Learning and Out of Distribution Generalization in a Systematic Reasoning Task »
Mustafa Abdool · Andrew Nam · James McClelland -
2023 : Towards Large Language Models as Copilots for Theorem Proving in Lean »
Peiyang Song · Kaiyu Yang · Animashree Anandkumar -
2023 : Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models »
Pan Lu · Baolin Peng · Hao Cheng · Michel Galley · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Jianfeng Gao -
2023 : SCIBENCH: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models »
Xiaoxuan Wang · Ziniu Hu · Pan Lu · Yanqiao Zhu · Jieyu Zhang · Satyen Subramaniam · Arjun Loomba · Shichang Zhang · Yizhou Sun · Wei Wang -
2023 : Llemma: An Open Language Model For Mathematics »
Zhangir Azerbayev · Hailey Schoelkopf · Keiran Paster · Marco Dos Santos · Stephen McAleer · Albert Q. Jiang · Jia Deng · Stella Biderman · Sean Welleck -
2023 : Sequential Learning and Retrieval in a Sparse Distributed Memory: The K-winner Modern Hopfield Network »
Shaunak Bhandarkar · James McClelland -
2023 : Sequential Learning and Retrieval in a Sparse Distributed Memory: The K-winner Modern Hopfield Network »
Shaunak Bhandarkar · James McClelland -
2023 Poster: What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks »
Taicheng Guo · kehan Guo · Bozhao Nan · Zhenwen Liang · Zhichun Guo · Nitesh Chawla · Olaf Wiest · Xiangliang Zhang -
2023 Poster: Self-Refine: Iterative Refinement with Self-Feedback »
Aman Madaan · Niket Tandon · Prakhar Gupta · Skyler Hallinan · Luyu Gao · Sarah Wiegreffe · Uri Alon · Nouha Dziri · Shrimai Prabhumoye · Yiming Yang · Shashank Gupta · Bodhisattwa Prasad Majumder · Katherine Hermann · Sean Welleck · Amir Yazdanbakhsh · Peter Clark -
2023 Poster: Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models »
Pan Lu · Baolin Peng · Hao Cheng · Michel Galley · Kai-Wei Chang · Ying Nian Wu · Song-Chun Zhu · Jianfeng Gao -
2023 Poster: Learning to Receive Help: Intervention-Aware Concept Embedding Models »
Mateo Espinosa Zarlenga · Katie Collins · Krishnamurthy Dvijotham · Adrian Weller · Zohreh Shams · Mateja Jamnik -
2023 Poster: Faith and Fate: Limits of Transformers on Compositionality »
Nouha Dziri · Ximing Lu · Melanie Sclar · Xiang (Lorraine) Li · Liwei Jiang · Bill Yuchen Lin · Sean Welleck · Peter West · Chandra Bhagavatula · Ronan Le Bras · Jena Hwang · Soumya Sanyal · Xiang Ren · Allyson Ettinger · Zaid Harchaoui · Yejin Choi -
2023 Poster: LeanDojo: Theorem Proving with Retrieval-Augmented Language Models »
Kaiyu Yang · Aidan Swope · Alex Gu · Rahul Chalamala · Peiyang Song · Shixing Yu · Saad Godil · Ryan J Prenger · Animashree Anandkumar -
2023 Oral: LeanDojo: Theorem Proving with Retrieval-Augmented Language Models »
Kaiyu Yang · Aidan Swope · Alex Gu · Rahul Chalamala · Peiyang Song · Shixing Yu · Saad Godil · Ryan J Prenger · Animashree Anandkumar -
2023 Tutorial: Machine Learning for Theorem Proving »
Emily First · Albert Q. Jiang · Kaiyu Yang -
2022 Workshop: MATH-AI: Toward Human-Level Mathematical Reasoning »
Pan Lu · Swaroop Mishra · Sean Welleck · Yuhuai Wu · Hannaneh Hajishirzi · Percy Liang -
2022 : Invited Talk: James McClelland »
James McClelland -
2022 Poster: Autoformalization with Large Language Models »
Yuhuai Wu · Albert Q. Jiang · Wenda Li · Markus Rabe · Charles Staats · Mateja Jamnik · Christian Szegedy -
2022 Poster: COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics »
Lianhui Qin · Sean Welleck · Daniel Khashabi · Yejin Choi -
2022 Poster: Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers »
Albert Q. Jiang · Wenda Li · Szymon Tworkowski · Konrad Czechowski · Tomasz Odrzygóźdź · Piotr Miłoś · Yuhuai Wu · Mateja Jamnik -
2022 Poster: Data Distributional Properties Drive Emergent In-Context Learning in Transformers »
Stephanie Chan · Adam Santoro · Andrew Lampinen · Jane Wang · Aaditya Singh · Pierre Richemond · James McClelland · Felix Hill -
2022 Poster: QUARK: Controllable Text Generation with Reinforced Unlearning »
Ximing Lu · Sean Welleck · Jack Hessel · Liwei Jiang · Lianhui Qin · Peter West · Prithviraj Ammanabrolu · Yejin Choi -
2022 Poster: NaturalProver: Grounded Mathematical Proof Generation with Language Models »
Sean Welleck · Jiacheng Liu · Ximing Lu · Hannaneh Hajishirzi · Yejin Choi -
2022 Poster: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering »
Pan Lu · Swaroop Mishra · Tanglin Xia · Liang Qiu · Kai-Wei Chang · Song-Chun Zhu · Oyvind Tafjord · Peter Clark · Ashwin Kalyan -
2021 Workshop: Math AI for Education (MATHAI4ED): Bridging the Gap Between Research and Smart Education »
Pan Lu · Yuhuai Wu · Sean Welleck · Xiaodan Liang · Eric Xing · James McClelland -
2021 : NaturalProofs: Mathematical Theorem Proving in Natural Language »
Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho -
2021 Poster: Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals »
Lang Liu · Krishna Pillutla · Sean Welleck · Sewoong Oh · Yejin Choi · Zaid Harchaoui -
2021 Poster: Learning Signal-Agnostic Manifolds of Neural Fields »
Yilun Du · Katie Collins · Josh Tenenbaum · Vincent Sitzmann -
2021 Poster: MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers »
Krishna Pillutla · Swabha Swayamdipta · Rowan Zellers · John Thickstun · Sean Welleck · Yejin Choi · Zaid Harchaoui -
2021 Oral: MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers »
Krishna Pillutla · Swabha Swayamdipta · Rowan Zellers · John Thickstun · Sean Welleck · Yejin Choi · Zaid Harchaoui -
2018 Poster: Loss Functions for Multiset Prediction »
Sean Welleck · Zixin Yao · Yu Gai · Jialin Mao · Zheng Zhang · Kyunghyun Cho -
2017 Poster: Saliency-based Sequential Image Attention with Multiset Prediction »
Sean Welleck · Jialin Mao · Kyunghyun Cho · Zheng Zhang