Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Jiaqi Yin · Yi-Wei Chen · MENG-LUNG LEE · Xiya Liu

Project Page [ OpenReview]

Abstract

Enterprise data pipelines, characterized by complex transformations across multiple programming languages, create semantic disconnect between original metadata and transformed data. This "semantic drift" compromises data governance and impairs retrieval-augmented generation (RAG) and text-to-SQL systems. We propose a novel framework for automated schema lineage extraction from multilingual enterprise scripts, capturing four essential components: source schemas, source tables, transformation logic, and aggregation operations.We introduce Schema Lineage Composite Evaluation (SLiCE), a comprehensive metric assessing structural correctness and semantic fidelity, and present a benchmark of 1,700 manually annotated lineages from real-world industrial scripts. Evaluating 12 language models (1.3B-32B parameters, including GPT-4o/4.1), we demonstrate that extraction performance scales with model size and prompting sophistication. Specially, a 32B open-source model with chain-of-thought prompting achieves performance comparable to GPT-series models, enabling cost-effective deployment of schema-aware agents while maintaining rigorous data governance and enhancing downstream AI applications.

Chat is not available.