Evaluation and Benchmarking Suite for Financial Large Language Models and Agents
Abstract
The financial industry has witnessed a rapid evolution of Large Language Models (LLMs) from initial exploration to current readiness and future governance. Financial large language models (FinLLMs), such as FinGPT \cite{liu2023fingpt} and BloombergGPT \cite{wu2023bloomberggpt}, have great potential in financial applications, including sentiment analysis, trading, and SEC filings. However, general LLMs and FinLLMs still face unique challenges of lack of professional financial knowledge and cannot handle the complex financial scenarios. This paper presents a comprehensive framework for evaluating the evolving LLM lifecycle in finance, introducing the Open FinLLM Leaderboard as a standardized benchmarking suite. We trace the development through three critical stages: Financial LLM Exploration (2023), Financial LLM Readiness (2024), and Financial LLM Governance (2025). The Open FinLLM Leaderboard serves as a comprehensive evaluation platform that enables researchers and practitioners to benchmark and compare different financial LLMs systematically, fostering the development of more robust and reliable financial AI systems.