Skip to yearly menu bar Skip to main content

Workshop: AI for Science: from Theory to Practice

Expression Sampler as a Dynamic Benchmark for Symbolic Regression

Ioana Marinescu · Younes Strittmatter · Chad Williams · Sebastian Musslick


Equation discovery, the problem of identifying mathematical expressions from data, has witnessed the emergence of symbolic regression (SR) techniques aided by benchmarking systems like SRbench. However, these systems are limited by their reliance on static expressions and datasets, which, in turn, provides limited insight into the circumstances under which SR algorithms perform well versus fail. To address this issue, we introduce an open-source method for generating comprehensive SR datasets via random sampling of mathematical expressions. This method enables dynamic expression sampling while controlling for various expression characteristics pertaining to expression complexity. The method also allows for using prior information about expression distributions, for example, to simulate expression distributions for a specific scientific domain. Using this dynamic benchmark, we demonstrate that the overall performance of established SR algorithms decreases with expression complexity and provide insight into which equation features are best recovered. Our results suggest that most SR algorithms overestimate the number of expression tree nodes and trigonometric functions and underestimate the number of input variables present in the ground truth.

Chat is not available.