OpenCityCorpus: A Large-Scale, Harmonized, and LLM-Ready Corpus of Urban Data for Scientific Research
Junfeng Jiao · Sean Hardesty Lewis · Yiming Xu · Jihyung Park · Connor Phillips
Abstract
We propose $\textit{OpenCityCorpus}$, an openly shareable, large-scale corpus that harmonizes public urban data from 200+ cities across Socrata, ArcGIS, and CKAN portals into a unified schema and an LLM-ready text representation. Fragmentation across municipal platforms has long impeded rigorous, cross-city science on climate, mobility, governance, and public health. Our dataset resolves schema heterogeneity, standardizes types and coordinate systems, and converts rows into semantically consistent factual statements, enabling retrieval-augmented generation, hypothesis testing, and transfer learning. The resource targets three AI-for-Science tasks: cross-domain scientific reasoning over coupled urban systems, surrogate modeling that complements physics-based simulators, and robust evaluation of tool-augmented LLM agents. We detail a feasible, privacy-preserving data-creation pathway, outline cost- and scale-aware operations for continuous refresh, and describe benchmarks designed to expose both the reach and the limits of current AI methods. By turning fragmented open portals into a single scientific substrate, $\textit{OpenCityCorpus}$ lowers barriers to high-impact, reproducible discovery.
Chat is not available.
Successful Page Load