Skip to yearly menu bar Skip to main content

Workshop: Table Representation Learning Workshop

Modeling string entries for tabular data prediction: do we need big large language models?

Leo Grinsztajn · Myung Jun Kim · Edouard Oyallon · Gael Varoquaux

Keywords: [ embeddings ] [ tabular data ] [ language models ]


Tabular data are often characterized by numerical and categorical features. But these features co-exist with features made of text entries, such as names or descriptions. Here, we investigate whether language models can extract information from these text entries. Studying 19 datasets and varying training sizes, we find that using language model to encode text features improve predictions upon no encodings and character-level approaches based on substrings. Furthermore, we find that larger, more advanced language models translate to more significant improvements.

Chat is not available.