From Walled Gardens to Open Streets: A Pipeline for Cross-City Data Harmonization
Sean Hardesty Lewis · Junfeng Jiao
Abstract
We present $\textit{OpenCityPipeline}$, a compact, end-to-end workflow that turns fragmented municipal open data into a unified, semantically enriched resource suitable for efficient model training. Urban data is severely fragmented across disparate platforms (e.g., Socrata, ArcGIS, CKAN), hindering holistic analysis and large-scale research. Our pipeline implements platform-aware ingestion, schema harmonization, targeted cleaning, redundancy control, and an optional data-to-text layer that renders structured records directly consumable by modern retrieval and language models. We describe how the workflow curates what cities already publish into higher-value training material and an indexable evidence base. The design aligns with efforts in curated data for efficient learning by reducing integration overhead, removing redundancy, and surfacing representative, auditable samples for downstream tasks.
Chat is not available.
Successful Page Load