Towards Real-World Evaluation of Agentic Work in Freelance Marketplaces
Mattie Terzolo · Darvin Yi · Teng Liu · Lance Hasson · Ayan Sinha · Pablo Mendes · Andrew Rabinovich
Abstract
Evaluating large language models (LLMs) on complex end-to-end digital work remains an open challenge. Many existing benchmarks are synthetic, static, or single-domain, limiting real world applicability and economic relevance. We present LaborMarketplaceBenchmark, a dataset and an evaluation pipeline derived from real tasks on LaborMarketplace. Starting from the marketplace corpus, we construct LaborMarketplaceBenchmark Qualified via heuristics-based filtering of fixed-price, single-milestone tasks and an automated feasibility assessment (Qualification Agent). We then derive LaborMarketplaceBenchmark Verified, a manually validated, PII-safe subset suitable for research use by the community. LaborMarketplaceBenchmark spans nine work categories and 572 unique task types, with tasks that resulted in an accepted deliverable and payouts ranging from \$35 to \$250 per job on average, enabling economically grounded and dynamically refreshable evaluation. We show initial results for several leading LLMs on real-world Writing tasks, with human-in-the-loop experiments where agents iterate on their work based on human feedback. LaborMarketplaceBenchmark provides a practical, reproducible path to measure real-world progress while illuminating where current systems fall short.
Chat is not available.
Successful Page Load