Workshop
Table Representation Learning Workshop
Madelon Hulsebos · Bojan Karlaš · Haoyu Dong · Gael Varoquaux · Laurel Orr · Pengcheng Yin
Room 235 - 236
Tables are a promising modality for representation learning with too much application potential to ignore. However, tables have long been overlooked despite their dominant presence in the data landscape, e.g. data management and analysis pipelines. The majority of datasets in Google Dataset Search, for example, resembles typical tabular file formats like CSVs. Similarly, the top-3 most-used database management systems are all relational (RDBMS). Representation learning over tables (TRL), possibly combined with other modalities such as text or SQL, has shown impressive performance for tasks like table-based question answering, table understanding, and data preparation. More recently, TRL was shown to be effective for tabular ML as well, while researchers also started exploring the impressive capabilities of LLMs for table encoding and data manipulation. Follow our Twitter feed for updates: https://twitter.com/TrlWorkshop.
The first edition of the Table Representation Learning (TRL) workshop at NeurIPS 2022 gathered an enthusiastic community and stimulated new research and collaborations, which we aim to continue in 2023. The TRL workshop has three main goals:
(1) Motivate tables as a primary modality for representation and generative learning and advance the area further.
(2) Showcase impactful applications of pretrained table models and discussing future opportunities.
(3) Foster discussion and collaboration across the ML, NLP and DB communities.
Schedule
Fri 6:30 a.m. - 6:45 a.m.
|
Opening notes
(
Talk
)
>
SlidesLive Video |
🔗 |
Fri 6:45 a.m. - 7:15 a.m.
|
Invited talk: Co-Designing LLMs and LLM-Powered Data Management Tools
(
Talk
)
>
SlidesLive Video Large Language Models (LLMs) are now widely used for data management. We recently proposed Evaporate [ICLR Spotlight 2023, VLDB 2024], a system that uses LLMs to help users efficiently query semi-structured documents. We also showed how off-the-shelf LLMs perform data-wrangling tasks with state-of-the-art quality and no specialized training [VLDB 2023]. This talk discusses some of my lessons from working on these early LLM-for-data-management projects and subsequent research to improve the reach of these systems — in particular, there is ways to go for extending LLMs to datatypes such as private, semi-structured, and long-sequence data. Towards extending our capabilities on these datatypes, I’ll discuss MQAR and Monarch Mixer [NeurIPS Oral 2023], new LM architectures that can match the quality of attention-based LMs, while remaining asymptotically more efficient at training and inference time. We’ll finally discuss how these fundamental breakthroughs can power next-generation data management tools. |
Simran Arora 🔗 |
Fri 7:15 a.m. - 7:22 a.m.
|
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
(
Spotlight
)
>
link
SlidesLive Video Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://anonymous.4open.science/r/NeurIPS23-TRL-2 to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning. |
ShengYun Peng · Seongmin Lee · Xiaojing Wang · Rajarajeswari Balasubramaniyan · Duen Horng Chau 🔗 |
Fri 7:23 a.m. - 7:30 a.m.
|
Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
(
Spotlight
)
>
link
SlidesLive Video Data-wrangling is a process that transforms raw data for further analysis and for use in downstream tasks. Recently, it has been shown that foundation models can be successfully used for data-wrangling tasks (Narayan et. al., 2022). An important aspect of data wrangling with LMs is to properly construct prompts for the given task. Within these prompts, a crucial component is the choice of in-context examples. In the previous study of Narayan et. al., demonstration examples are chosen manually by the authors, which may not be scalable to new datasets. In this work, we propose a simple demonstration strategy that individualizes demonstration examples for each input by selecting them from a pool based on their distance in the embedding space. Additionally, we propose a postprocessing method that exploits the embedding of labels under a closed-world assumption. Empirically, our embedding-based example retrieval and postprocessing improve foundation models' performance by up to 84\% over randomly selected examples and 49\% over manually selected examples in the demonstration. Ablation tests reveal the effect of class embeddings, and various factors in demonstration such as quantity, quality, and diversity. |
Joon Suk Huh · Changho Shin · Elina Choi 🔗 |
Fri 7:31 a.m. - 7:38 a.m.
|
TabPFGen – Tabular Data Generation with TabPFN
(
Spotlight
)
>
link
SlidesLive Video Advances in deep generative modelling have not translated well to tabular data. We argue that this is caused by a mismatch in structure between popular generative models and discriminative models of tabular data. We thus devise a technique to turn TabPFN -- a highly performant transformer initially designed for in-context discriminative tabular tasks -- into an energy-based generative model, which we dub TabPFGen. This novel framework leverages the pre-trained TabPFN as part of the energy function and does not require any additional training or hyperparameter tuning, thus inheriting TabPFN's in-context learning capability. We can sample from TabPFGen analogously to other energy-based models. We demonstrate strong results on standard generative modelling tasks, including data augmentation, class-balancing, and imputation, unlocking a new frontier of tabular data generation. |
Jeremy (Junwei) Ma · Apoorv Dankar · George Stein · Guangwei Yu · Anthony Caterini 🔗 |
Fri 7:38 a.m. - 7:45 a.m.
|
Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
(
Spotlight
)
>
link
SlidesLive Video Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications.In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, rather than documenting them for disambiguation in data tasks. This work delves into these data ambiguities in real-world datasets. We have identified prevalent data ambiguities of value consistency, data coverage, and data granularity that affect tasks. We examine how documentation, originally made to help humans to disambiguate data, can help GPT-4 with Text-to-SQL tasks. By offering documentation on these, we found GPT-4's performance improved by 28.9%. |
Zachary Huang · Pavan Kalyan Damalapati · Eugene Wu 🔗 |
Fri 7:46 a.m. - 7:53 a.m.
|
MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering
(
Spotlight
)
>
link
SlidesLive Video Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery. |
Vaishali Pal · Andrew Yates · Evangelos Kanoulas · Maarten Rijke 🔗 |
Fri 8:00 a.m. - 8:20 a.m.
|
Coffee break + poster setup
(
Break
)
>
|
🔗 |
Fri 8:20 a.m. - 9:00 a.m.
|
Poster Session 1
(
Poster session
)
>
|
🔗 |
Fri 9:00 a.m. - 9:30 a.m.
|
Invited talk: Advances in In-Context Learning for Tabular Datasets
(
Talk
)
>
A year ago, we introduced TabPFN, the first in-context learning method for tabular data. In this talk, I will discuss what happened since. I will start by briefly discussing CAAFE, a system that uses LLMs for automated feature engineering on tabular data and makes effective use of TabPFN's speed. Then, I will situate prior-fitted PFNs in the in-context learning literature, review various applications of PFN, explain TabPFN in some more detail and then discuss our ongoing work on removing TabPFN's remaining limitations. |
Frank Hutter 🔗 |
Fri 9:30 a.m. - 9:37 a.m.
|
Self-supervised Representation Learning from Random Data Projectors
(
Spotlight
)
>
link
SlidesLive Video Self-supervised representation learning SSRL has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities such as tabular and time-series data. This paper presents an SSRL approach that can be applied to these data modalities because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on real-world applications with tabular and time-series data. We show that it outperforms multiple state-of-the-art SSRL baselines and is competitive with methods built on domain-specific knowledge. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study. |
Yi Sui · Tongzi Wu · Jesse Cresswell · Ga Wu · George Stein · Xiao Shi Huang · Xiaochen Zhang · Maksims Volkovs 🔗 |
Fri 9:38 a.m. - 9:45 a.m.
|
GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
(
Spotlight
)
>
link
SlidesLive Video Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance and training stability. To address this, we propose GCondNet, a general approach to enhance neural networks by leveraging implicit structures present in tabular data. We create a graph between samples for each data dimension, and utilise Graph Neural Networks (GNNs) for extracting this implicit structure, and for conditioning the parameters of the first layer of an underlying predictor network. By creating many small graphs, GCondNet exploits the data's high-dimensionality, and thus improves the performance of an underlying predictor network. We demonstrate the effectiveness of our method on 9 real-world datasets, where GCondNet outperforms 15 standard and state-of-the-art methods. The results show that GCondNet is a versatile framework for injecting graph-regularisation into various types of neural networks, including MLPs and tabular Transformers. |
Andrei Margeloiu · Nikola Simidjievski · Pietro Lió · Mateja Jamnik 🔗 |
Fri 9:46 a.m. - 9:53 a.m.
|
HyperFast: Instant Classification for Tabular Data
(
Spotlight
)
>
link
SlidesLive Video Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at www.url-hidden-for-submission. |
David Bonet · Daniel Mas Montserrat · Xavier Giró-i-Nieto · Alexander Ioannidis 🔗 |
Fri 9:54 a.m. - 10:01 a.m.
|
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation
(
Spotlight
)
>
link
SlidesLive Video Tabular data is prevalent across various machine learning domains. Yet, the inherent heterogeneities in attribute and class spaces across different tabular datasets hinder the effective sharing of knowledge, limiting a tabular model to benefit from other datasets. In this paper, we propose Tabular data Pre-Training via Meta-representation (TabPTM), which allows one tabular model pre-training on a set of heterogeneous datasets. Then, this pre-trained model can be directly applied to unseen datasets that have diverse attributes and classes without additional training. Specifically, TabPTM represents an instance through its distance to a fixed number of prototypes, thereby standardizing heterogeneous tabular datasets. A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences, endowing TabPTM with the ability of training-free generalization. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios. |
Han-Jia Ye · Qile Zhou · De-Chuan Zhan 🔗 |
Fri 10:00 a.m. - 11:30 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Fri 11:30 a.m. - 12:00 p.m.
|
Invited talk: Next-Generation Data Management with Large Language Models
(
Talk
)
>
SlidesLive Video The past years have been marked by several breakthrough results in the domain of generative AI, culminating in the rise of tools like ChatGPT, able to solve a variety of language-related tasks without specialized training. In this talk, I discuss several recent research projects at Cornell, exploiting large language models to enhance relational database management systems. These projects cover applications of language models in the database interface, enabling users to specify high-level analysis goals for fully automated end-to-end analysis, as well as applications in the backend, using language models to extract useful information for data profiling and database tuning from text documents. |
Immanuel Trummer 🔗 |
Fri 12:00 p.m. - 12:07 p.m.
|
Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
(
Spotlight
)
>
link
SlidesLive Video Large language models (LLMs) are increasingly applied for tabular tasks usingin-context learning. The prompt representation for a table may play a role in theLLMs ability to process the table. Inspired by prior work, we generate a collectionof self-supervised structural tasks (e.g. navigate to a cell and row; transpose thetable) and evaluate the performance differences when using 8 formats. In contrastto past work, we introduce 8 noise operations inspired by real-world messy dataand adversarial inputs, and show that such operations can impact LLM performanceacross formats for different structural understanding tasks. |
Ananya Singha · José Cambronero · Sumit Gulwani · Vu Le · Chris Parnin 🔗 |
Fri 12:08 p.m. - 12:15 p.m.
|
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
(
Spotlight
)
>
link
SlidesLive Video Large language models (LLMs) with in-context learning have demonstrated remarkable capability in the text-to-SQL task. Previous research has prompted LLMs with various demonstration-retrieval strategies and intermediate reasoning steps to enhance the performance of LLMs. However, those works often employ varied strategies when constructing the prompt text for text-to-SQL inputs, such as databases and demonstration examples. This leads to a lack of comparability in both the prompt constructions and their primary contributions. Furthermore, selecting an effective prompt construction has emerged as a persistent problem for future research. To address this limitation, we comprehensively investigate the impact of prompt constructions across various settings and provide insights into prompt constructions for future text-to-SQL studies. |
Shuaichen Chang · Eric Fosler-Lussier 🔗 |
Fri 12:16 p.m. - 12:23 p.m.
|
IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
(
Spotlight
)
>
link
SlidesLive Video There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose \texttt{IngesTables}, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. \texttt{IngesTables} employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, \texttt{IngesTables} is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached.We show that \texttt{IngesTables} demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency. |
Scott Yak · Yihe Dong · Javier Gonzalvo · Sercan Arik 🔗 |
Fri 12:30 p.m. - 1:00 p.m.
|
Invited talk: Advancing Natural Language Interfaces to Data with Language Models as Agents
(
Talk
)
>
SlidesLive Video Traditional Natural Language Interfaces (NLIs) to data often necessitate users to provide detailed, step-by-step instructions, reflecting an assumption of user familiarity with the underlying data and systems, which can limit accessibility. The emergence of Large Language Models (LLMs) has, however, revolutionized NLIs, enabling them to perform sophisticated reasoning, decision-making, and planning multi-step actions in diverse environments autonomously. In this talk, I will discuss how these language models as agents facilitate a paradigm shift towards moving beyond traditional code generation to more autonomous and user-friendly NLIs, capable of understanding high-level objectives without requiring intricate directives. I will also present our latest work in this direction, including instruction-finetuned retrievers for diverse environment adaptation, the enhancement of LLM capabilities with tool integration, and the development of open, state-of-the-art LLMs and platforms for constructing such language agents. The talk will conclude with an exploration of the current and future research prospects in this rapidly evolving domain. |
Tao Yu 🔗 |
Fri 1:00 p.m. - 1:20 p.m.
|
Coffee Break + poster setup
(
Break
)
>
|
🔗 |
Fri 1:20 p.m. - 2:00 p.m.
|
Poster Session 2
(
Poster Session
)
>
|
🔗 |
Fri 2:00 p.m. - 2:30 p.m.
|
Invited talk: Enabling Large Language Models to Reason with Tables
(
Talk
)
>
SlidesLive Video Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, there is still much to learn about how well LLMs understand structured data, such as tables. While it is true that tables can be used as inputs to LLMs with serialization, there lack comprehensive studies examining whether LLMs can truly comprehend such data. In this talk, I will cover different ways to utilize LLMs to interface with tables. One approach is to feed the whole table as a sequence to LLMs for reasoning. In this direction, we will talk about the recent paper GPT4Table to summarize the lessons learned in different table linearization strategies, including table input format, content order, role prompting, and partition marks. The other approach is to use tools like SQL or other language to interface with a table for data access without feeding the entire table. LLMs will work as a reasoner to derive the answer based on the interfaced results from the table. |
Wenhu Chen 🔗 |
Fri 2:30 p.m. - 3:15 p.m.
|
Panel - TBA
(
Panel
)
>
SlidesLive Video |
🔗 |
Fri 3:15 p.m. - 3:30 p.m.
|
Closing notes
(
Talk
)
>
SlidesLive Video |
🔗 |
-
|
MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering
(
Poster
)
>
link
Recent advances in tabular question answering (QA) with large language models are constrained in their coverage and only answer questions over a single table. However, real-world queries are complex in nature, often over multiple tables in a relational database or web page. Single table questions do not involve common table operations such as set operations, Cartesian products (joins), or nested queries. Furthermore, multi-table operations often result in a tabular output, which necessitates table generation capabilities of tabular QA models. To fill this gap, we propose a new task of answering questions over multiple tables. Our model, MultiTabQA, not only answers questions over multiple tables, but also generalizes to generate tabular answers. To enable effective training, we build a pre-training dataset comprising of 132,645 SQL queries and tabular answers. Further, we evaluate the generated tables by introducing table-specific metrics of varying strictness assessing various levels of granularity of the table structure. MultiTabQA outperforms state-of-the-art single table QA models adapted to a multi-table QA setting by finetuning on three datasets: Spider, Atis and GeoQuery. |
Vaishali Pal · Andrew Yates · Evangelos Kanoulas · Maarten Rijke 🔗 |
-
|
Generating Data Augmentation Queries Using Large Language Models
(
Poster
)
>
link
Users often want to augment entities in their datasets with relevant informationAs many external sources are accessible only via keyword-search interfaces, a user usually has to manually formulate a keyword query that extracts relevant information for each entity.This is challenging as many data sources contain numerous tuples, only a small fraction of which may be relevant.Moreover, different datasets may represent the same information in distinct forms and under different terms.In such cases, it is difficult to formulate a query that precisely retrieves information relevant to a specific entity.Current methods for information enrichment mainly rely on resource-intensive manual effort to formulate queries to discover relevant information. However, it is often important for users to get initial answers quickly and without substantial investment in resources (such as human attention).We propose a progressive approach to discovering entity-relevant information from external sources with minimal expert intervention. It leverages end users' feedback to progressively learn how to retrieve information relevant to each entity in a dataset from external data sources.To bootstrap performance, we use a pre-trained large language model (LLM) to produce rich representations of entities. We evaluate the use of parameter efficient techniques for aligning the LLM's representations with our downstream task of online query policy learning. |
Christopher Buss · Jasmin Mousavi · Mikhail Tokarev · Arash Termehchy · David Maier · Stefan Lee 🔗 |
-
|
ReConTab: Regularized Contrastive Representation Learning for Tabular Data
(
Poster
)
>
link
Representation learning stands as one of the critical machine learning techniques across various domains. Through the acquisition of high-quality features, pre-trained embeddings significantly reduce input space redundancy, benefiting downstream pattern recognition tasks such as classification, regression, or detection. Nonetheless, in the domain of tabular data, feature engineering and selection still heavily rely on manual intervention, leading to time-consuming processes and necessitating domain expertise. In response to this challenge, we introduce ReConTab, a deep automatic representation learning framework with regularized contrastive learning. Agnostic to any type of modeling task, ReConTab constructs an asymmetric autoencoder based on the same raw features from model inputs, producing low-dimensional representative embeddings. Specifically, regularization techniques are applied for raw feature selection. Meanwhile, ReConTab leverages contrastive learning to distill the most pertinent information for downstream tasks. Experiments conducted on extensive real-world datasets substantiate the framework's capacity to yield substantial and robust performance improvements. Furthermore, we empirically demonstrate that pre-trained embeddings can seamlessly integrate as easily adaptable features, enhancing the performance of various traditional methods such as XGBoost and Random Forest. |
Suiyao Chen · Jing Wu · NAIRA HOVAKIMYAN · Handong Yao 🔗 |
-
|
Unlocking the Transferability of Tokens in Deep Models for Tabular Data
(
Poster
)
>
link
Fine-tuning a pre-trained deep neural network has become a successful paradigm in various machine learning tasks. However, such a paradigm becomes particularly challenging with tabular data when there are discrepancies between the feature sets of pre-trained models and the target tasks. In this paper, we propose TabToken, a method aims at enhancing the quality of feature tokens (i.e., embeddings of tabular features). TabToken allows for the utilization of pre-trained models when the upstream and downstream tasks share overlapping features, facilitating model fine-tuning even with limited training examples. Specifically, we introduce a contrastive objective that regularizes the tokens, capturing the semantics within and across features. During the pre-training stage, the tokens are learned jointly with top-layer deep models such as transformer. In the downstream task, tokens of the shared features are kept fixed while TabToken efficiently fine-tunes the remaining parts of the model. TabToken not only enables knowledge transfer from a pre-trained model to tasks with heterogeneous features, but also enhances the discriminative ability of deep tabular models in standard classification and regression tasks. |
Qile Zhou · Han-Jia Ye · Leye Wang · De-Chuan Zhan 🔗 |
-
|
Augmentation for Context in Financial Numerical Reasoning over Textual and Tabular Data with Large-Scale Language Model
(
Poster
)
>
link
Constructing large-scale datasets for numerical reasoning over tabular and textual data in the financial domain is particularly challenging. Moreover, even the commonly used augmentation techniques for dataset construction prove to be ineffective in augmenting financial dataset. To address this challenge, this paper proposes a context augmentation methodology for enhancing the financial dataset, which generates new contexts for the original question. To do this, we leverage the hallucination capability of large-scale generative language models. Specifically, by providing instructions with constraints for context generation with the original dataset's questions and arithmetic programs together as input to the language model's prompt, we create plausible contexts that provide evidence for the given questions. The experimental results showed that the reasoning performance improved when we augmented the FinQA dataset using our methodology and trained the model with it. |
Yechan Hwang · Jinsu Lim · Young-Jun Lee · Ho-Jin Choi 🔗 |
-
|
TabContrast: A Local-Global Level Method for Tabular Contrastive Learning
(
Poster
)
>
link
Representation learning is a cornerstone of contemporary artificial intelligence, significantly boosting performance across diverse downstream tasks. Notably, domains like computer vision and NLP have witnessed transformative advancements owing to self-supervised contrastive learning techniques. Yet, the translation of these techniques to tabular data remains an intricate challenge. Traditional approaches, especially within the tabular arena, tend to explore model architecture and loss function design, often overlooking the nuanced creation of positive and negative sample pairs. These pairs are vital, shaping the quality of the learned representations and the overall model efficacy. Recognizing this imperative, our paper probes the specificities of tabular data and the unique challenges it presents. As a solution, we introduce "TabContrast". This method adopts a local-global contrast approach, segmenting features into subsets and subsequently performing tailored clustering to unveil inherent data patterns. By aligning samples with cluster centroids and emphasizing clear semantic distinctions, TabContrast promises enhanced representation efficacy. Preliminary evaluations highlight its potential, particularly in tabular datasets with more features available. |
Hao Liu · Yixin Chen · Bradley A Fritz · Christopher King 🔗 |
-
|
Explaining Explainers: Necessity and Sufficiency in Tabular Data
(
Poster
)
>
link
In recent days, ML classifiers trained on tabular data are used to make efficient and fast decisions for various decision-making tasks. The lack of transparency in the decision-making processes of these models have led to the emergence of EXplainable AI (XAI). However, discrepancies exist among XAI programs, raising concerns about their accuracy. The notion of what an “important" and “relevant" feature is, is different for different explanation strategies. Thus grounding them using theoretically backed ideas of necessity and sufficiency can prove to be a reliable way to increase their trustworthiness. We propose a novel approach to quantify these two concepts in order to provide a means to explore which explanation method might be suitable for tasks involving the implementation of sparse high dimensional tabular datasets. Moreover, our global necessity and sufficiency scores aim to help experts to correlate their domain knowledge with our findings and also allow an extra basis for evaluation of the results provided by popular local explanation methods like LIME and SHAP. |
Prithwijit Chowdhury · Mohit Prabhushankar · Ghassan AlRegib 🔗 |
-
|
Beyond Individual Input for Deep Anomaly Detection on Tabular Data
(
Poster
)
>
link
Anomaly detection is vital in many domains, such as finance, healthcare, and cybersecurity. In this paper, we propose a novel deep anomaly detection method for tabular data that leverages Non-Parametric Transformers (NPTs), a model initially proposed for supervised tasks, to capture both feature-feature and sample-sample dependencies. In a reconstruction-based framework, we train the NPT to reconstruct masked features of normal samples. In a non-parametric fashion, we leverage the whole training set during inference and use the model's ability to reconstruct the masked features to generate an anomaly score. To the best of our knowledge, this is the first work to successfully combine feature-feature and sample-sample dependencies for anomaly detection on tabular datasets. Through extensive experiments on 31 benchmark tabular datasets, we demonstrate that our method achieves state-of-the-art performance, outperforming existing methods by 2.4% and 1.2% in terms of F1-score and AUROC, respectively. Our ablation study provides evidence that modeling both types of dependencies is crucial for anomaly detection on tabular data. |
Hugo Thimonier · Fabrice Popineau · Arpad Rimmel · Bich-Liên DOAN 🔗 |
-
|
GradTree: Learning Axis-Aligned Decision Trees with Gradient Descent
(
Poster
)
>
link
Decision Trees (DTs) are commonly used for many machine learning tasks due to their high degree of interpretability. However, learning a DT from data is a difficult optimization problem, as it is non-convex and non-differentiable. Therefore, common approaches learn DTs using a greedy growth algorithm that minimizes the impurity locally at each internal node. Unfortunately, this greedy procedure can lead to inaccurate trees.In this paper, we present a novel approach for learning hard, axis-aligned DTs with gradient descent. The proposed method uses backpropagation with a straight-through operator on a dense DT representation, to jointly optimize all tree parameters.Our approach outperforms existing methods on a wide range of binary classification benchmarks and is available under: https://github.com/s-marton/GradTree |
Sascha Marton · Stefan Lüdtke · Christian Bartelt · Heiner Stuckenschmidt 🔗 |
-
|
Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
(
Poster
)
>
link
While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization \url{https://github.com/tabmem/tool}. |
Sebastian Bordt · Harsha Nori · Rich Caruana 🔗 |
-
|
InterpreTabNet: Enhancing Interpretability of Tabular Data Using Deep Generative Models and Large Language Models
(
Poster
)
>
link
Tabular data are omnipresent in various sectors of industries. Neural networks for tabular data such as TabNet have been proposed to make predictions while leveraging the attention mechanism for interpretability. We find that the inferred attention masks on high-dimensional data are often dense, hindering interpretability. To remedy this, we propose the InterpreTabNet, a variant of the TabNet model that models the attention mechanism as a latent variable sampled from a Gumbel-Softmax distribution. This enables us to regularize the model to learn distinct concepts in the attention masks via a KL Divergence regularizer. It prevents overlapping feature selection which maximizes the model's efficacy and improves interpretability. To automate the interpretation of the features from our model, we employ GPT-4 and use prompt engineering to map from the learned feature mask onto natural language text describing the learned signal. Through comprehensive experiments on real-world datasets, we demonstrate that our InterpreTabNet Model outperforms previous methods for learning from tabular data while attaining competitive accuracy and interpretability. |
Jacob Yoke Hong Si · Rahul Krishnan · Michael Cooper · Wendy Yusi Cheng 🔗 |
-
|
On Incorporating new Variables during Evaluation
(
Poster
)
>
link
Any classification or regression model needs access to the same features or input that were utilized to train the model. However in real world scenarios, several models are in operation for years and in those cases new variables/features may be available during the inferencing stage. If such features are to be utilized their values have to be captured in the dataset that was utilized for training the model. We propose a model agnostic approach where we trained a model without the access to those features during the training stage, which could benefit from the additional features available during testing. We show that by using the proposed approach and without any access to the extra features during the training phase, we are able to improve the performance of the model on four real world tabular datasets. We provide extensive analysis on how and which variables result in the improvement over the model which was trained without the extra feature. |
Harsimran Bhasin · Soumyadeep Ghosh 🔗 |
-
|
GCondNet: A Novel Method for Improving Neural Networks on Small High-Dimensional Tabular Data
(
Poster
)
>
link
Neural network models often struggle with high-dimensional but small sample-size tabular datasets. One reason is that current weight initialisation methods assume independence between weights, which can be problematic when there are insufficient samples to estimate the model's parameters accurately. In such small data scenarios, leveraging additional structures can improve the model's performance and training stability. To address this, we propose GCondNet, a general approach to enhance neural networks by leveraging implicit structures present in tabular data. We create a graph between samples for each data dimension, and utilise Graph Neural Networks (GNNs) for extracting this implicit structure, and for conditioning the parameters of the first layer of an underlying predictor network. By creating many small graphs, GCondNet exploits the data's high-dimensionality, and thus improves the performance of an underlying predictor network. We demonstrate the effectiveness of our method on 9 real-world datasets, where GCondNet outperforms 15 standard and state-of-the-art methods. The results show that GCondNet is a versatile framework for injecting graph-regularisation into various types of neural networks, including MLPs and tabular Transformers. |
Andrei Margeloiu · Nikola Simidjievski · Pietro Lió · Mateja Jamnik 🔗 |
-
|
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
(
Poster
)
>
link
Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://anonymous.4open.science/r/NeurIPS23-TRL-2 to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning. |
ShengYun Peng · Seongmin Lee · Xiaojing Wang · Rajarajeswari Balasubramaniyan · Duen Horng Chau 🔗 |
-
|
Unnormalized Density Estimation with Root Sobolev Norm Regularization
(
Poster
)
>
link
Density estimation is one of the central problems in non-parametric statistical learning. While parametric neural network-based methods have achieved notable success in fields such as image and text, their non-parametric counterparts lag, particularly in higher dimensions. Non-parametric methods, known for their conceptual simplicity and explicit model bias, can offer enhanced interpretability and more effective regularization control in smaller data regimes or other data modalities.We propose a new approach to non-parametric density estimation that isbased on regularizing a Sobolev norm of the density. This method is statistically consistent, is different from Kernel Density Estimation, and makes the inductive bias of the model clear and interpretable.\textbf{Our method is assessed against the comprehensive ADBench suite for tabular Anomaly Detection, ranking second among over 15 algorithms}, all of which are specifically tailored for anomaly detection in tabular data.The contributions of this paper are as follows: 1. While there is no closed analytic form for the associated kernel, we show that one can approximate it using sampling. 2. The optimization problem needed to determine the density is non-convex, and standard gradient methods do not perform well. However, we show that with an appropriate initialization and using natural gradients, one can obtain well-performing solutions. 3. While the approach provides unnormalized densities, which prevents the use of log-likelihood for cross-validation, we show that one can instead adapt Fisher Divergence-based Score Matching methods for this task. |
Mark Kozdoba · Binyamin Perets · Shie Mannor 🔗 |
-
|
Self-supervised Representation Learning from Random Data Projectors
(
Poster
)
>
link
Self-supervised representation learning SSRL has advanced considerably by exploiting the transformation invariance assumption under artificially designed data augmentations. While augmentation-based SSRL algorithms push the boundaries of performance in computer vision and natural language processing, they are often not directly applicable to other data modalities such as tabular and time-series data. This paper presents an SSRL approach that can be applied to these data modalities because it does not rely on augmentations or masking. Specifically, we show that high-quality data representations can be learned by reconstructing random data projections. We evaluate the proposed approach on real-world applications with tabular and time-series data. We show that it outperforms multiple state-of-the-art SSRL baselines and is competitive with methods built on domain-specific knowledge. Due to its wide applicability and strong empirical results, we argue that learning from randomness is a fruitful research direction worthy of attention and further study. |
Yi Sui · Tongzi Wu · Jesse Cresswell · Ga Wu · George Stein · Xiao Shi Huang · Xiaochen Zhang · Maksims Volkovs 🔗 |
-
|
Tree-Regularized Tabular Embeddings
(
Poster
)
>
link
Tabular neural network (NN) has attracted remarkable attentions and its recent advances have gradually narrowed the performance gap with respect to tree-based models on many public datasets. While the mainstream focus on calibrating NN to fit tabular data, we emphasize the importance of homogeneous embeddings and alternately concentrate on regularizing tabular inputs through supervised pretraining. Specifically, we extend a recent work named DeepTLF, and utilize the structure of pretrained tree ensembles to transform raw variables into a single vector (T2V), or an array of tokens (T2T). Without loss of space efficiency, these binarized embeddings can be directly consumed by canonical tabular NN with full-connected or attention-based building blocks. Through quantitative experiments on 88 OpenML datasets with binary classification task, we validated that the proposed tree-regularized representations not only taper the difference with respect to tree-based models, but also achieve on-par and better performance when compared with advanced NN models. Most importantly, it possesses better robustness and can be easily scaled and generalized as standalone encoder for tabular modality. |
Xuan Li · Yun Wang · Bo Li 🔗 |
-
|
Binning as a Pretext Task: Improving Self-Supervised Learning in Tabular Domains
(
Poster
)
>
link
The ability of deep networks to learn superior representations hinges on leveraging the proper inductive biases, considering the inherent properties of datasets. In tabular domains, it is critical to effectively handle heterogeneous features (both categorical and numerical) in a unified manner and to grasp irregular functions like piecewise constant functions.To address the challenges in the self-supervised learning framework, we propose a novel pretext task based on the classical binning method. The idea is straightforward: reconstructing the bin indices (either orders or classes) rather than the original values. This pretext task provides the encoder with an inductive bias to capture the irregular dependencies, mapping from continuous inputs to discretized bins, and mitigates the feature heterogeneity by setting all features to have category-type targets.Our empirical investigations ascertain several advantages of binning: compatibility with encoder architecture and additional modifications, standardizing all features into equal sets, grouping similar values within a feature, and providing ordering information. Comprehensive evaluations across diverse tabular datasets corroborate that our method consistently improves tabular representation learning performance for a wide range of downstream tasks. |
Kyungeun Lee · Ye Seul Sim · Hyeseung Cho · Suhee Yoon · Sanghyu Yoon · Woohyung Lim 🔗 |
-
|
A Deep Learning Blueprint for Relational Databases
(
Poster
)
>
link
We introduce a modular neural message-passing scheme that closely follows the formal model of relational databases, effectively enabling end-to-end deep learning directly from database storages. We experiment with several instantiations of the scheme, including notably the use of cross-attention modules to capture the referential constraints of the relational model. We address the issues of efficient learning data representation and loading, salient to the database setting, and compare against representative models from a number of related fields, demonstrating favorable initial results. |
Lukáš Zahradník · Jan Neumann · Gustav Šír 🔗 |
-
|
Scaling TabPFN: Sketching and Feature Selection for Tabular Prior-Data Fitted Networks
(
Poster
)
>
link
Tabular classification has traditionally relied on supervised algorithms, which estimate the parameters of a prediction model using its training data. Recently, Prior-Data Fitted Networks such as TabPFN have successfully learned to classify tabular data in-context: the model parameters are designed to classify new samples based on labelled training samples given after the model training. While such models show great promise, their applicability to real-world data remains limited due to the computational scale needed. We conduct an initial investigation of sketching and feature-selection methods for TabPFN, and note certain key differences between it and conventionally fitted tabular models. |
Benjamin Feuer · Niv Cohen · Chinmay Hegde 🔗 |
-
|
Modeling string entries for tabular data prediction: do we need big large language models?
(
Poster
)
>
link
Tabular data are often characterized by numerical and categorical features. But these features co-exist with features made of text entries, such as names or descriptions. Here, we investigate whether language models can extract information from these text entries. Studying 19 datasets and varying training sizes, we find that using language model to encode text features improve predictions upon no encodings and character-level approaches based on substrings. Furthermore, we find that larger, more advanced language models translate to more significant improvements. |
Leo Grinsztajn · Myung Jun Kim · Edouard Oyallon · Gael Varoquaux 🔗 |
-
|
HyperFast: Instant Classification for Tabular Data
(
Poster
)
>
link
Training deep learning models and performing hyperparameter tuning can be computationally demanding and time-consuming. Meanwhile, traditional machine learning methods like gradient-boosting algorithms remain the preferred choice for most tabular data applications, while neural network alternatives require extensive hyperparameter tuning or work only in toy datasets under limited settings. In this paper, we introduce HyperFast, a meta-trained hypernetwork designed for instant classification of tabular data in a single forward pass. HyperFast generates a task-specific neural network tailored to an unseen dataset that can be directly used for classification inference, removing the need for training a model. We report extensive experiments with OpenML and genomic data, comparing HyperFast to competing tabular data neural networks, traditional ML methods, AutoML systems, and boosting machines. HyperFast shows highly competitive results, while being significantly faster. Additionally, our approach demonstrates robust adaptability across a variety of classification tasks with little to no fine-tuning, positioning HyperFast as a strong solution for numerous applications and rapid model deployment. HyperFast introduces a promising paradigm for fast classification, with the potential to substantially decrease the computational burden of deep learning. Our code, which offers a scikit-learn-like interface, along with the trained HyperFast model, can be found at www.url-hidden-for-submission. |
David Bonet · Daniel Mas Montserrat · Xavier Giró-i-Nieto · Alexander Ioannidis 🔗 |
-
|
Hopular: Modern Hopfield Networks for Tabular Data
(
Poster
)
>
link
While Deep Learning excels in structured data as encountered in vision and natural language processing, it failed to meet its expectations on tabular data. For tabular data, Support Vector Machines (SVMs), Random Forests, and Gradient Boosting are the best performing techniques with Gradient Boosting in the lead. Recently, we saw a surge of Deep Learning methods that were tailored to tabular data but still underperform compared to Gradient Boosting on small-sized datasets. We suggest "Hopular", a novel Deep Learning architecture for medium- and small-sized datasets, where each layer is equipped with continuous modern Hopfield networks. The modern Hopfield networks use stored data to identify feature-feature, feature-target, and sample-sample dependencies. Hopular's novelty is that every layer can directly access the original input as well as the whole training set via stored data in the Hopfield networks. Therefore, Hopular can step-wise update its current model and the resulting prediction at every layer like standard iterative learning algorithms. In experiments on small-sized tabular datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods. In experiments on medium-sized tabular data with about 10,000 samples, Hopular outperforms XGBoost, CatBoost, LightGBM and a state-of-the art Deep Learning method designed for tabular data. Thus, Hopular is a strong alternative to these methods on tabular data. |
Bernhard Schäfl · Lukas Gruber · Angela Bitto · Sepp Hochreiter 🔗 |
-
|
Training-Free Generalization on Heterogeneous Tabular Data via Meta-Representation
(
Poster
)
>
link
Tabular data is prevalent across various machine learning domains. Yet, the inherent heterogeneities in attribute and class spaces across different tabular datasets hinder the effective sharing of knowledge, limiting a tabular model to benefit from other datasets. In this paper, we propose Tabular data Pre-Training via Meta-representation (TabPTM), which allows one tabular model pre-training on a set of heterogeneous datasets. Then, this pre-trained model can be directly applied to unseen datasets that have diverse attributes and classes without additional training. Specifically, TabPTM represents an instance through its distance to a fixed number of prototypes, thereby standardizing heterogeneous tabular datasets. A deep neural network is then trained to associate these meta-representations with dataset-specific classification confidences, endowing TabPTM with the ability of training-free generalization. Experiments validate that TabPTM achieves promising performance in new datasets, even under few-shot scenarios. |
Han-Jia Ye · Qile Zhou · De-Chuan Zhan 🔗 |
-
|
NeuroDB: Efficient, Privacy-Preserving and Robust Query Answering with Neural Networks
(
Poster
)
>
link
The Neural Database framework, or NeuroDB for short, is a novel means of query answering using neural networks. It utilizes neural networks as a means of data storage by training neural networks to directly answer queries. That is, neural networks are trained to take queries as input and output query answer estimates. In doing so, relational tables are represented by neural network weights and are queried through a model forward pass. NeuroDB has shown significant practical advantages in (1) approximate query processing, (2) privacy-preserving query answering, and (3) querying incomplete datasets. The success of the NeuroDB framework can be attributed to the approach learning patterns present in the query answers, utilized to learn a compact representation of the dataset with respect to the queries. This allows learning small neural networks that accurately and efficiently represent query answers. Meanwhile, learning such patterns allows for improving the accuracy in the presence of error, with such robustness to noise allowing for improved accuracy in the case of private query answering and query answering on incomplete datasets. This paper presents an overview of the NeuroDB framework and its applications to the three aforementioned scenarios. |
Sepanta Zeighami · Cyrus Shahabi 🔗 |
-
|
A DB-First approach to query factual information in LLMs
(
Poster
)
>
link
In many use-cases, information is stored in text but not available in structured data. However, extracting data from natural language (NL) text to precisely fit a schema, and thus enable querying, is a challenging task. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of text documents. Thus, we envision the use of SQL queries to cover a broad range of data that is not captured by traditional databases (DBs) by tapping the information in LLMs. This ability enables querying the factual information in LLMs with the SQL interface, which is more precise than NL prompts. We present a traditional DB architecture using physical operators for querying the underlying LLM. The key idea is to execute some operators of the query plan with prompts that retrieve data from the LLM. For a large class of SQL queries, querying LLMs returns well structured relations, with encouraging qualitative results. |
Mohammed SAEED · Nicola De Cao · Paolo Papotti 🔗 |
-
|
A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning
(
Poster
)
>
link
Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent over-fitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of LASSO for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features. |
Valeriia Cherepanova · Roman Levin · Gowthami Somepalli · Jonas Geiping · C. Bayan Bruss · Andrew Wilson · Tom Goldstein · Micah Goldblum 🔗 |
-
|
Incorporating LLM Priors into Tabular Learners
(
Poster
)
>
link
We present a method to integrate Large Language Models (LLMs) and traditional tabular data classification techniques, addressing LLMs’ challenges like data serialization sensitivity and biases. We introduce two strategies utilizing LLMs for ranking categorical variables and generating priors on correlations between continuous variables and targets, enhancing performance in few-shot scenarios. We focus on Logistic Regression, introducing MonotonicLR that employs a non-linear monotonic function for mapping ordinals to cardinals while preserving LLM-determined orders. Validation against baseline models reveals the superior performance of our approach, especially in low-data scenarios, while remaining interpretable. |
Max Zhu · Siniša Stanivuk · Andrija Petrovic · Mladen Nikolic · Pietro Lió 🔗 |
-
|
CHORUS: Foundation Models for Unified Data Discovery and Exploration
(
Poster
)
>
link
We apply foundation models to data discovery and exploration tasks. Foundation models are large language models (LLMs) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the dataset contamination. All in all, this suggests a future direction in which disparate data management tasks can be unified under foundation models. |
Moe Kayali · Anton Lykov · Ilias Fountalis · Nikolaos Vasiloglou · Dan Olteanu · Dan Suciu 🔗 |
-
|
Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs
(
Poster
)
>
link
Large language models (LLMs) are increasingly applied for tabular tasks usingin-context learning. The prompt representation for a table may play a role in theLLMs ability to process the table. Inspired by prior work, we generate a collectionof self-supervised structural tasks (e.g. navigate to a cell and row; transpose thetable) and evaluate the performance differences when using 8 formats. In contrastto past work, we introduce 8 noise operations inspired by real-world messy dataand adversarial inputs, and show that such operations can impact LLM performanceacross formats for different structural understanding tasks. |
Ananya Singha · José Cambronero · Sumit Gulwani · Vu Le · Chris Parnin 🔗 |
-
|
Introducing the Observatory Library for End-to-End Table Embedding Inference
(
Poster
)
>
link
Transformer-based tabular language models have become prevalent for a wide range of applications involving tabular data. Such models require the serialization of a table as a sequence of tokens for model ingestion and embedding inference. Different downstream tasks require different kinds or levels of embeddings such as column or entity embeddings. Hence, various serialization and encoding methods have been proposed and implemented. Surprisingly, this conceptually simple process of creating table embeddings is not straightforward in practice for a few reasons: 1) a model may not natively expose a certain level of embedding; 2) choosing the correct table serialization and input preprocessing methods is difficult because there are many available; and 3) tables with a massive number of rows and columns cannot fit the input limit of models. In this work, we extend Observatory, a framework for characterizing embeddings of relational tables, by streamlining end-to-end inference of table embeddings, which eases the use of tabular language models in practice. The codebase of Observatory is publicly available at https://github.com/superctj/observatory. |
Tianji Cong · Zhenjie Sun · Paul Groth · H. V. Jagadish · Madelon Hulsebos 🔗 |
-
|
Scaling Experiments in Self-Supervised Cross-Table Representation Learning
(
Poster
)
>
link
To analyze the scaling potential of deep tabular representation learning models, we introduce a novel Transformer-based architecture specifically tailored to tabular data and cross-table representation learning by utilizing table-specific tokenizers and a shared Transformer backbone.Our training approach encompasses both single-table and cross-table models, trained via missing value imputation through a self-supervised masked cell recovery objective.To understand the scaling behavior of our method, we train models of varying sizes, ranging from approximately $10^4$ to $10^7$ parameters. These models are trained on a carefully curated pretraining dataset, consisting of 135 M training tokens sourced from 76 diverse datasets.We assess the scaling of our architecture in both single-table and cross-table pretraining setups by evaluating the pretrained models using linear probing on a curated set of benchmark datasets and comparing the results with conventional baselines.
|
Maximilian Schambach · Dominique Paul · Johannes Otterbach 🔗 |
-
|
Benchmarking Tabular Representation Models in Transfer Learning Settings
(
Poster
)
>
link
Deep learning has revolutionized the transfer of knowledge between similar tasks in data modalities such as images, text, and graphs. However, the same level of success has not been attained in for tabular data. This disparity can be attributed to the inherent absence of structural characteristics, such as spatial and temporal correlations, within common tabular datasets. Moreover, classic methods such as logistic regression and decision trees have been shown to perform competitively with deep learning methods. In this work, we benchmark the classic and deep learning methods specifically within the setting of transfer learning. We offer new benchmarking results for the EHR phenotyping task in the MetaMIMIC dataset and propose a new transfer learning setting of transferring mortality prediction from common to rare cancers with The Cancer Genome Atlas (TCGA). |
Qixuan Jin · Talip Ucar 🔗 |
-
|
Exploring the Retrieval Mechanism for Tabular Deep Learning
(
Poster
)
>
link
While interests in tabular deep learning has significantly grown, conventional tree-based models still outperform deep learning methods. To narrow this performance gap, we explore the innovative retrieval mechanism, a methodology that allows neural networks to refer to other data points while making predictions. Our experiments reveal that retrieval-based training, especially when fine-tuning the pretrained TabPFN model, notably surpasses existing methods. Moreover, the extensive pretraining plays a crucial role to enhance the performance of the model. These insights imply that blending the retrieval mechanism with pretraining and transfer learning schemes offers considerable potential for advancing the field of tabular deep learning. |
Felix den Breejen · Sangmin Bae · Stephen Cha · Tae-Young Kim · Seoung Hyun Koh · Se-Young Yun 🔗 |
-
|
In Defense of Zero Imputation for Tabular Deep Learning
(
Poster
)
>
link
Missing values are a common problem in many supervised learning contexts. While a wealth of literature exists related to missing value imputation, less literature has focused on the impact of imputation on downstream supervised learning. Recently, impute-then-predict neural networks have been proposed as a powerful solution to this problem, allowing for joint optimization of imputations and predictions. In this paper, we illustrate a somewhat surprising result: multi-layer perceptrons (MLPs) paired with zero imputation perform as well as more powerful deep impute-then-predict models on real-world data. To support this finding, we analyze the results of various deep impute-then-predict models to better understand why they fail to outperform zero imputation. Our analysis sheds light onto the difficulties of imputation in real-world contexts, and highlights the utility of zero imputation for tabular deep learning. |
John Van Ness · Madeleine Udell 🔗 |
-
|
Data Ambiguity Strikes Back: How Documentation Improves GPT's Text-to-SQL
(
Poster
)
>
link
Text-to-SQL allows experts to use databases without in-depth knowledge of them. However, real-world tasks have both query and data ambiguities. Most works on Text-to-SQL focused on query ambiguities and designed chat interfaces for experts to provide clarifications.In contrast, the data management community has long studied data ambiguities, but mainly addresses error detection and correction, rather than documenting them for disambiguation in data tasks. This work delves into these data ambiguities in real-world datasets. We have identified prevalent data ambiguities of value consistency, data coverage, and data granularity that affect tasks. We examine how documentation, originally made to help humans to disambiguate data, can help GPT-4 with Text-to-SQL tasks. By offering documentation on these, we found GPT-4's performance improved by $28.9$%.
|
Zachary Huang · Pavan Kalyan Damalapati · Eugene Wu 🔗 |
-
|
IngesTables: Scalable and Efficient Training of LLM-Enabled Tabular Foundation Models
(
Poster
)
>
link
There is a massive amount of tabular data that can be taken advantage of via `foundation models' to improve prediction performance for downstream tabular prediction tasks. However, numerous challenges constitute bottlenecks in building tabular foundation models, including learning semantic relevance between tables and features, mismatched schemes, arbitrarily high cardinality for categorical values, and scalability to many tables, rows and features. We propose \texttt{IngesTables}, a novel canonical tabular foundation model building framework, designed to address the aforementioned challenges. \texttt{IngesTables} employs LLMs to encode representations of table/feature semantics and the relationships, that are then modeled via an attention-based tabular architecture. Unlike other LLM-based approaches, \texttt{IngesTables} is much cheaper to train and faster to run inference, because of how LLM-generated embeddings are defined and cached.We show that \texttt{IngesTables} demonstrates significant improvements over commonly-used models like XGBoost on clinical trial datasets in standard supervised learning settings, and is competitive with tabular prediction models that are specialized for clinical trial datasets without incurring LLM-level cost and latency. |
Scott Yak · Yihe Dong · Javier Gonzalvo · Sercan Arik 🔗 |
-
|
Pool-Search-Demonstrate: Improving Data-wrangling LLMs via better in-context examples
(
Poster
)
>
link
Data-wrangling is a process that transforms raw data for further analysis and for use in downstream tasks. Recently, it has been shown that foundation models can be successfully used for data-wrangling tasks (Narayan et. al., 2022). An important aspect of data wrangling with LMs is to properly construct prompts for the given task. Within these prompts, a crucial component is the choice of in-context examples. In the previous study of Narayan et. al., demonstration examples are chosen manually by the authors, which may not be scalable to new datasets. In this work, we propose a simple demonstration strategy that individualizes demonstration examples for each input by selecting them from a pool based on their distance in the embedding space. Additionally, we propose a postprocessing method that exploits the embedding of labels under a closed-world assumption. Empirically, our embedding-based example retrieval and postprocessing improve foundation models' performance by up to 84\% over randomly selected examples and 49\% over manually selected examples in the demonstration. Ablation tests reveal the effect of class embeddings, and various factors in demonstration such as quantity, quality, and diversity. |
Joon Suk Huh · Changho Shin · Elina Choi 🔗 |
-
|
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
(
Poster
)
>
link
Large language models (LLMs) with in-context learning have demonstrated remarkable capability in the text-to-SQL task. Previous research has prompted LLMs with various demonstration-retrieval strategies and intermediate reasoning steps to enhance the performance of LLMs. However, those works often employ varied strategies when constructing the prompt text for text-to-SQL inputs, such as databases and demonstration examples. This leads to a lack of comparability in both the prompt constructions and their primary contributions. Furthermore, selecting an effective prompt construction has emerged as a persistent problem for future research. To address this limitation, we comprehensively investigate the impact of prompt constructions across various settings and provide insights into prompt constructions for future text-to-SQL studies. |
Shuaichen Chang · Eric Fosler-Lussier 🔗 |
-
|
TabPFGen – Tabular Data Generation with TabPFN
(
Poster
)
>
link
Advances in deep generative modelling have not translated well to tabular data. We argue that this is caused by a mismatch in structure between popular generative models and discriminative models of tabular data. We thus devise a technique to turn TabPFN -- a highly performant transformer initially designed for in-context discriminative tabular tasks -- into an energy-based generative model, which we dub TabPFGen. This novel framework leverages the pre-trained TabPFN as part of the energy function and does not require any additional training or hyperparameter tuning, thus inheriting TabPFN's in-context learning capability. We can sample from TabPFGen analogously to other energy-based models. We demonstrate strong results on standard generative modelling tasks, including data augmentation, class-balancing, and imputation, unlocking a new frontier of tabular data generation. |
Jeremy (Junwei) Ma · Apoorv Dankar · George Stein · Guangwei Yu · Anthony Caterini 🔗 |
-
|
Multitask-Guided Self-Supervised Tabular Learning for Patient-Specific Survival Prediction
(
Poster
)
>
link
Survival prediction, central to the analysis of clinical trials, has the potential to be transformed by the availability of RNA-seq data as it reveals the underlying molecular and genetic mechanisms for disease and outcomes. However, the amount of RNA-seq samples available for understudied or rare diseases is often limited. To address this, leveraging data across different cancer types can be a viable solution, necessitating the application of self-supervised learning techniques. Yet, this wealth of data often comes in a tabular format without a known structure, hindering the development of a generally effective augmentation method for survival prediction. While traditional methods have been constrained by a one cancer-one model philosophy or have relied solely on a single modality, our approach, Guided-STab, on the contrary, offers a comprehensive approach through pretraining on all available RNA-seq data from various cancer types while guiding the representation by incorporating sparse clinical features as auxiliary tasks. With a multitask-guided self-supervised representation learning framework, we maximize the potential of vast unlabeled datasets from various cancer types, leading to genomic-driven survival predictions. These auxiliary clinical tasks then guide the learned representations to enhance critical survival factors. Extensive experiments reinforce the promise of our approach, as Guided-STab consistently outperforms established benchmarks on TCGA dataset. |
You Wu · Omid Bazgir · Yongju Lee · Tommaso Biancalani · James Lu · Ehsan Hajiramezanali 🔗 |
-
|
Testing the Limits of Unified Sequence to Sequence LLM Pretraining on Diverse Table Data Tasks
(
Poster
)
>
link
Tables stored in databases and tables which are present in web pages and articles account for a large part of semi-structured data that is available on the internet. It motivates the need to develop a modeling approach with large language models (LLMs) which can be used to solve diverse table tasks such as semantic parsing, question answering as well as classification problems. Traditionally, there existed separate sequence to sequence models specialized for each table task individually. It raises the question of how far can we go to build a unified model that works well on some table tasks without significant degradation on others. To that end, we attempt at creating a shared modeling approach in the pretraining stage with encoder-decoder style LLMs that can cater to diverse tasks. We evaluate our approach that continually pretrains and finetunes different model families of T5 with data from tables and surrounding context, on these downstream tasks at different model scales. Through multiple ablation studies, we observe that our pretraining with self-supervised objectives can significantly boost the performance of the models on these tasks. Our work is the first attempt at studying the advantages of a unified approach to table specific pretraining when scaled from 770M to 11B sequence to sequence models while also comparing the instruction finetuned variants of the models. |
Soumajyoti Sarkar · Leonard Lausen 🔗 |