Nested Depth Generalization in Transformers
Abstract
Solving math problems harder than any already solved problem requires generalization of reasoning outside the training domain. This paper investigates generalization in tasks rooted in propositional logic that become increasingly challenging when expressions encapsulate more deeply nested clauses. We first evaluate general purpose language models on expressions of different depth. This shows that language models struggle with highly nested expressions, despite supervised fine-tuning, chain-of-thought reasoning with large throughput or self-reflection. We observe that small errors propagate through the nested chain of reasoning, leading to incorrect outputs. To disentangle the role of training set from implicit bias, we also train specialized models from scratch on expressions of bounded depth. Our evaluations on test sets of varying depths suggest that large models trained on large samples exhibit mild signs of depth extrapolation, but miss compact shortcut solutions that can solve the problem perfectly. For both specialized and general purpose models we demonstrate that recursive test-time inference traversing the parsed abstract syntax tree of the input unlocks depth generalization, yet remains imperfect.