Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework
Abstract
This paper studies the logical reasoning capabilities of language models. In this setting, the model takes assumptions and a goal as input and generates a formal proof deriving the goal from the assumptions. Incorrect proofs are caught by automated validation. We focus on the conceptually simple but technically challenging task of constructing proofs in Boolean logic, where it is feasible to train and test at scale with synthetic data. We apply randomization to generate a large, diverse corpus of proofs for training and testing. We introduce Template Transformation, a proof augmentation technique that generates syntactically varied proofs that share a common logical structure. We propose natural black-box tests to evaluate reasoning ability. By those measures, experiments show that LoRA-based fine tuning combined with Template Transformation significantly improves reasoning ability; however, reasoning accuracy declines sharply with increasing proof complexity. These findings highlight the need for fresh approaches to the reasoning question.