Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks
Abstract
Enterprise data pipelines, characterized by complex transformations across multiple programming languages, create semantic disconnect between original metadata and transformed data. This "semantic drift" compromises data governance and impairs retrieval-augmented generation (RAG) and text-to-SQL systems. We propose a novel framework for automated schema lineage extraction from multilingual enterprise scripts, capturing four essential components: source schemas, source tables, transformation logic, and aggregation operations.We introduce Schema Lineage Composite Evaluation (SLiCE), a comprehensive metric assessing structural correctness and semantic fidelity, and present a benchmark of 1,700 manually annotated lineages from real-world industrial scripts. Evaluating 12 language models (1.3B-32B parameters, including GPT-4o/4.1), we demonstrate that extraction performance scales with model size and prompting sophistication. Specially, a 32B open-source model with chain-of-thought prompting achieves performance comparable to GPT-series models, enabling cost-effective deployment of schema-aware agents while maintaining rigorous data governance and enhancing downstream AI applications.