Poster
SETBENCH: Assessing the Analytical and Semantic Robustness of Language Models
Nicholas Dronen · Bardiya Akhbari · Manish Digambar Gawali
West Ballroom A-D #5203
Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs' algorithmic abilities under simple lexical or semantic variations. To this end, we present \textsc{SetBench}, a synthetic benchmark that evaluates the performance of LLMs on set operations. \textsc{SetBench} assesses the robustness of LLMs' instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating five LLMs with \textsc{SetBench}, we find that they exhibit poor robustness to variation in both operation and operands. We find that confounding can occur when performing these evaluations and show results when confounding variables are measured independently. Upon publication, we will release the \textsc{SetBench} dataset and code repository, contributing to the advancement of research in this domain.
Live content is unavailable. Log in and register to view live content