Credit Cards, Confusion, Computation, and Consequences: How Well Do LLMs Reason About Financial Literacy?
Arnav Hiray · Agam Shah · Caleb Lu · Meghaj Tarte · Sreya Tummalapelly · Harsit Mittal · Sudheer Chava
Abstract
We introduce $\textbf{FinLitQA}$, the first benchmark of long-context financial numerical reasoning questions derived directly from real credit card agreements. The dataset contains over 4,300 questions, including first-person variants that reflect how consumers naturally ask about fees and payments. Evaluating multiple large language reasoning models under Chain-of-Thought (CoT) and Program-of-Thought (PoT) prompting, we find that first-person CoT improves accuracy, while PoT increases answer coverage but reduces accuracy. Our error analysis highlights weaknesses with compounding using the exponential operator, understanding interest, timeline ordering, categorical policy logic, and minimum payment rules. We find that these errors often arise in edge cases such as late-payment penalties or small-balance scenarios *that are more likely to affect lower-income or financially vulnerable individuals*.
Chat is not available.
Successful Page Load