Skip to yearly menu bar Skip to main content

Workshop: Table Representation Learning Workshop

IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models

Scott Yak · Yihe Dong · Javier Gonzalvo · Sercan Arik

Keywords: [ structured data ] [ Foundation Model ] [ tabular ] [ LLM ]

[ ] [ Project Page ]
Fri 15 Dec 12:16 p.m. PST — 12:23 p.m. PST
presentation: Table Representation Learning Workshop
Fri 15 Dec 6:30 a.m. PST — 3:30 p.m. PST


There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose \texttt{IngesTables}, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. \texttt{IngesTables} employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, \texttt{IngesTables} is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached.We show that \texttt{IngesTables} demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency.

Chat is not available.