We develop large models to “understand” images, videos and natural language that fuel many intelligent applications from text completion to self-driving cars. But tabular data has long been overlooked despite its dominant presence in data-intensive systems. By learning latent representations from (semi-)structured tabular data, pretrained table models have shown preliminary but impressive performance for semantic parsing, question answering, table understanding, and data preparation. Considering that such tasks share fundamental properties inherent to tables, representation learning for tabular data is an important direction to explore further. These works also surfaced many open challenges such as finding effective data encodings, pretraining objectives and downstream tasks.
Key questions that we aim to address in this workshop are:
- How should tabular data be encoded to make learned Table Models generalize across tasks?
- Which pre-training objectives, architectures, fine-tuning and prompting strategies, work for tabular data?
- How should the varying formats, data types, and sizes of tables be handled?
- To what extend can Language Models be adapted towards tabular data tasks and what are their limits?
- What tasks can existing Table Models accomplish well and what opportunities lie ahead?
- How do existing Table Models perform, what do they learn, where and how do they fall short?
- When and how should Table Models be updated in contexts where the underlying data source continuously evolves?
The First Table Representation Learning workshop is the first workshop in this emerging research area and is centered around three main goals:
1) Motivate tabular data as primal modality for representation learning and further shaping this area.
2) Showcase impactful applications of pretrained table models and discussing future opportunities thereof.
3) Foster discussion and collaboration across the machine learning, natural language processing, and data management communities.
Speakers
Alon Halevy (keynote), Meta AI
Graham Neubig (keynote), Carnegie Mellon University
Carsten Binnig, TU Darmstadt
Çağatay Demiralp, Sigma Computing
Huan Sun, Ohio State University
Xinyun Chen, Google Brain
Panelists
TBA
Scope
We invite submissions that address, but are not limited to, any of the following topics on machine learning for tabular data:
Representation Learning Representation learning techniques for structured (e.g., relational databases) or semi-structured (Web tables, spreadsheet tables) tabular data and interfaces to it. This includes developing specialized data encodings or adaptation of general-purpose ones (e.g., GPT-3) for tabular data, multimodal learning across tables, and other modalities (e.g., natural language, images, code), and relevant fine-tuning and prompting strategies.
Downstream Applications Machine learning applications involving tabular data, such as data preparation (e.g. data cleaning, integration, cataloging, anomaly detection), retrieval (e.g., semantic parsing, question answering, fact-checking), information extraction, and generation (e.g., table-to-text).
Upstream Applications Applications that use representation learning to optimize tabular data processing systems, such as table parsers (extracting tables from documents, spreadsheets, presentations, images), storage (e.g. compression, indexing), and querying (e.g. query plan optimization, cost estimation).
Industry Papers Applications of tabular representation models in production. Challenges of maintaining and managing table representation models in a fast evolving context, e.g. data updating, error correction, monitoring.
New Resources Survey papers, analyses, benchmarks and datasets for tabular representation models and their applications, visions and reflections to structure and guide future research.
Important dates
Submission open: 20 August 2022
Submission deadline: 26 September 2022
Notifications: 20 October 2022
Camera-ready, slides and recording upload: 3 November 2022
Workshop: 2 December 2022
Submission formats
Abstract: 1 page + references.
Extended abstract: at most 4 pages + references.
Regular paper: at least 6 pages + references.
Questions:
table-representation-learning-workshop@googlegroups.com (public)
m.hulsebos@uva.nl (private)
Fri 6:30 a.m. - 6:45 a.m.
|
Opening Remarks
(
Notes
)
SlidesLive Video » |
🔗 |
Fri 6:45 a.m. - 7:30 a.m.
|
Alon Halevy - "Structured Data Inside and Out"
(
Keynote
)
SlidesLive Video » WebTables contain high-quality data that is relevant to many queries on search engines. Since they are embedded inside web pages, understanding the semantics of tables requires analyzing the text surrounding them on the page. This talk will begin by recalling some of the early challenges we faced with the WebTables Project at Google. I will then turn to a different kind of challenge at the intersection of structured and unstructured data, where the structured data is outside and the unstructured data is inside. For example, when modeling a set of events in a person’s life (or history of an enterprise or a culture), each event is described in text and other media, but the event is also associated with structured data such as time and location. Answering questions over such collections of data requires leveraging the structure in the data appropriately. In the second half of the will discuss the motivations, challenges and partial solutions to dealing with structured data that is on the outside. |
Alon Halevy 🔗 |
Fri 7:30 a.m. - 7:45 a.m.
|
Analysis of the Attention in Tabular Language Models
(
Talk
)
link »
SlidesLive Video » Recent transformer-based models for learning table representation have reported state-of-the-art results for different tasks such as table understanding, question answering and semantic parsing. The various proposed models use different architectures, specifically different attention mechanisms. In this paper, we analyze and compare the attention mechanisms used by two different tabular language models. By visualizing the attention maps of the models, we shed a light on the different patterns that the models exhibit. With our analysis on the aggregate attention over two tabular datasets, we provide insights which might help towards building more efficient models tailored for table representation learning. |
Aneta Koleva · Martin Ringsquandl · Volker Tresp 🔗 |
Fri 7:45 a.m. - 8:15 a.m.
|
Huan Sun - "Self-supervised Pre-training on Tables"
(
Talk
)
SlidesLive Video » Pre-training/fine-tuning paradigms have transformed the natural language processing field. For table-based tasks, however, their potential has been far less explored. In this talk, I will discuss the recent efforts led by my Ph.D. student Xiang Deng: (1) TURL, a pre-training/fine-tuning paradigm on relational Web tables, which benefits a wide range of tasks for table understanding (e.g., row population, relation extraction, entity linking). This work won the ACM SIGMOD Research Highlight Award in 2022. (2) StruG, a weakly supervised Structure-Grounded pretraining framework for text-to-SQL, which effectively learns to capture the text-table alignment essential for the task. At the time we tested our model on the Spider leaderboard in 2020, it was ranked 6th under the setting using DB content and 1st if without using DB content. (3) ReasonBERT, a pre-training method that augments language models for multi-step reasoning over hybrid contexts (textual and tabular). Among them, I will cover TURL in greater detail. Finally, I will conclude the talk with my thoughts about promising future directions. |
Huan Sun 🔗 |
Fri 8:15 a.m. - 8:30 a.m.
|
Coffee/Tea Break
|
🔗 |
Fri 8:30 a.m. - 9:15 a.m.
|
Poster Session 1
(
Poster Session
)
|
🔗 |
Fri 9:15 a.m. - 9:45 a.m.
|
Carsten Binnig - Pre-trained Models for Learned DBMS Components
(
Talk
)
SlidesLive Video » Database management systems (DBMSs) are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems such as query cost estimation. To tackle such problems, very recent work has outlined a new direction of so-called learned DBMS components where core parts of DBMSs are being replaced by machine learning (ML) models. While this line of work has shown to provide significant performance benefits for DBMS, a major drawback of the current so-called workload-driven learning approaches to enable learned DBMS components is that they cause a very high and repeated overhead for training data collection. Hence, in this talk, I will discuss a new direction of so-called zero-shot DBMS models which are pre-trained models that avoid the repeated training data collection overhead. As a concrete first step, we have realized a zero-shot cost model that can predict query execution cost which is a core DBMS task on an unseen database (i.e., a new set of tables with data) out of the box. Furthermore, I will also discuss other more recent results on how the general idea of zero-shot DBMS models can also be applied to other DBMS components as well or how it can even be applied even beyond DBMSs for other data systems. |
Carsten Binnig 🔗 |
Fri 9:45 a.m. - 10:00 a.m.
|
STable: Table Generation Framework for Encoder-Decoder Models
(
Talk
)
link »
SlidesLive Video » The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%. |
Michał Pietruszka · Michał Turski · Łukasz Borchmann · Tomasz Dwojak · Gabriela Pałka · Karolina Szyndler · Dawid Jurkiewicz · Łukasz Garncarek 🔗 |
Fri 10:00 a.m. - 10:15 a.m.
|
Transfer Learning with Deep Tabular Models
(
Talk
)
link »
SlidesLive Video » Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they are easily fine-tuned in new domains and learn reusable features. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we explore the benefits that representation learning provides for knowledge transfer in the tabular domain. We conduct experiments in a realistic medical diagnosis test bed with limited amounts of downstream data and find that transfer learning with deep tabular models provides a definitive advantage over gradient boosted decision tree methods. We further compare the supervised and self-supervised pretraining strategies and provide practical advice on transfer learning with tabular models. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. |
Roman Levin · Valeriia Cherepanova · Avi Schwarzschild · Arpit Bansal · C. Bayan Bruss · Tom Goldstein · Andrew Wilson · Micah Goldblum 🔗 |
Fri 10:15 a.m. - 11:30 a.m.
|
Lunch Break
|
🔗 |
Fri 11:30 a.m. - 12:15 p.m.
|
Graham Neubig - "Unsupervised Methods for Table and Schema Understanding"
(
Keynote
)
SlidesLive Video » In this talk I will discuss two methods that we have recently developed that allow for better understanding of tables. First, I will discuss OmniTab, a method for learning to represent tables using text- and table-based pre-training. Second, I will discuss a method for data augmentation that makes it possible to create pseudo-supervised training data for new database schemas. |
Graham Neubig 🔗 |
Fri 12:15 p.m. - 12:30 p.m.
|
Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning
(
Talk
)
link »
SlidesLive Video » Data wrangling tasks for data integration and cleaning arise in virtually every data-driven application scenario nowadays. Recent research indicated the astounding potential of Large Language Models (LLMs) for such tasks. The automation of data wrangling with LLMs poses additional challenges, however, as hand-tuning task and data-specific prompts for LLMs requires high expertise and manual effort. On the other hand, finetuning a whole LLM is more amenable to automation, but incurs high storage costs, as a copy of the LLM has to be maintained.In this work, we explore the potential of a lightweight alternative to finetuning an LLM, which automatically learns a continuous prompt. This approach called prefix-tuning does not require updating the original LLM parameters, and can therefore re-use a single LLM instance across tasks. At the same time, it is amenable to automation, as continuous prompts can be automatically learned with standard techniques.We evaluate prefix-tuning on common data wrangling tasks for tabular data such as entity matching, error detection, and data imputation, with promising results. We find that in six out of ten cases, prefix-tuning is within 2.3% of the performance of finetuning, even though it leverages only 0.39% of the parameter updates required for finetuning the full model. These results highlight the potential of prefix-tuning as a parameter-efficient alternative to finetuning for data integration and data cleaning with LLMs. |
David Vos · Till Döhmen · Sebastian Schelter 🔗 |
Fri 12:30 p.m. - 12:45 p.m.
|
Byung-Hak - "RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild"
(
Talk
)
SlidesLive Video » |
🔗 |
Fri 12:45 p.m. - 1:00 p.m.
|
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
(
Talk
)
link »
SlidesLive Video »
We present TabPFN, a trained Transformer model that can do tabular supervised classification for small datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods.TabPFN is entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. Our prior incorporates ideas from causal learning: It entails a large space of structural causal models with a preference for simple structures. Afterwards, the trained TabPFN approximates Bayesian prediction on any unseen tabular dataset, without any hyperparameter tuning or gradient-based learning.On 30 datasets from the OpenML-CC18 suite, we show that our method outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with a $70\times$ speedup. This increases to a $3\,200\times$ speedup when a GPU is available.We provide all our code and the trained TabPFN at https://anonymous.4open.science/r/TabPFN-2AEE. We also provide an online demo at https://huggingface.co/spaces/TabPFN/TabPFNPrediction.
|
Noah Hollmann · Samuel Müller · Katharina Eggensperger · Frank Hutter 🔗 |
Fri 1:15 p.m. - 1:30 p.m.
|
Coffee/Tea Break
|
🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Poster Session 2
(
Poster Session
)
|
🔗 |
Fri 2:00 p.m. - 2:30 p.m.
|
Xinyun Chen - "Program Synthesis from Semi-Structured Context"
(
Talk
)
SlidesLive Video » With the advancement of modern technologies, programming becomes ubiquitous not only among professional software developers, but also for general computer users. However, gaining programming expertise is time-consuming and challenging. Therefore, program synthesis has many applications, where the computer automatically synthesizes programs from user-written descriptions. In this talk, I will discuss my research on neural program synthesis from semi-structured context, where the synthesized program is executed on structured input for data processing and analysis. In particular, I will present my work SpreadsheetCoder for spreadsheet formula prediction, which was integrated into Google Sheets. Our work demonstrates that modeling the tabular structure and learning from multi-modal input is important for inferring user intents, especially when the program specifications are implicit and ambiguous. |
Xinyun Chen 🔗 |
Fri 2:30 p.m. - 3:30 p.m.
|
Panel [Huan Sun (chair), Frank Hutter, Heng Ji, Julian Eisenschlos, Gaël Varoquaux, Graham Neubig]
(
Panel
)
SlidesLive Video » |
🔗 |
Fri 3:30 p.m. - 3:45 p.m.
|
Closing Remarks
(
Notes
)
SlidesLive Video » The best paper award will be announced during this slot as well. |
🔗 |
-
|
The Need for Tabular Representation Learning: An Industry Perspective
(
Poster
)
link »
The total addressable market for data and intelligence applications has been estimated at \$70B. This includes the \$11B market for data integration, which is estimated to grow at 25\% in the coming year; \$35B market for analytics, growing at 11\%; and \$19B market for business intelligence, growing at 8\%. Given this data-driven future and the scale at which Microsoft operates (serving over 300K organizations with 50M+ end users), we leverage telemetry across our external and internal cloud and platform services (e.g., Azure, Microsoft 365, Visual Studio, etc.) to gain an understanding of our customer workloads and their constraints at play.
|
Joyce Cahoon · Alexandra Savelieva · Andreas Mueller · Avrilia Floratou · Carlo Curino · Hiren Patel · Jordan Henkel · Markus Weimer · Roman Batoukov · Shaleen Deep · Venkatesh Emani · Richard Wydrowski · Nellie Gustafsson
|
-
|
SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training
(
Poster
)
link »
Tabular data underpins numerous high-impact applications of machine learning from fraud detection to genomics and healthcare. Classical approaches to solving tabular problems, such as gradient boosting and random forests, are widely used by practitioners. However, recent deep learning methods have achieved a degree of performance competitive with popular techniques. We devise a hybrid deep learning approach to solving tabular data problems. Our method, SAINT, performs attention over both rows and columns, and it includes an enhanced embedding method. We also study a new contrastive self-supervised pre-training method for use when labels are scarce. SAINT consistently improves performance over previous deep learning methods, and it even performs competitively with gradient boosting methods, including XGBoost, CatBoost, and LightGBM, on average over $30$ benchmark datasets in regression, binary classification, and multi-class classification tasks.
|
Gowthami Somepalli · Avi Schwarzschild · Micah Goldblum · C. Bayan Bruss · Tom Goldstein 🔗 |
-
|
Generic Entity Resolution Models
(
Poster
)
link »
Entity resolution (ER) -- which decides whether two data records refer to the same real-world object -- is a long-standing data integration problem. The state-of-the-art results on ER are achieved by deep learning based methods, which typically convert each pair of records into a distributed representation, followed by using a binary classifier to decide whether these two records are a match or a non-match.However, these methods are dataset specific; that is, one deep learning based model needs to be trained or fine-tuned for each new dataset, which is not generalizable and thus we call them specific ER models. In this paper, we investigate generic ER models, which use a single model to serve multiple ER datasets over different datasets from various domains. In particular, we study two types of generic ER models: Employs foundation models ( e.g., GPT-3) or trains a generic ER model. Our results show that although GPT-3 can perform ER with zero-shot or few-shot learning, the performance is worse than specific ER models. Our trained generic ER model can achieve comparable performance with specific ER models, but with much less train data and much smaller storage overhead. |
Jiawei Tang · Yifei Zuo · Lei Cao · Samuel Madden 🔗 |
-
|
RoTaR: Efficient Row-Based Table Representation Learning via Teacher-Student Training (Short Paper)
(
Poster
)
link »
We propose RoTaR, a row-based table representation learning method, to address the efficiency and scalability issues faced by existing table representation learning methods. The key idea of RoTaR is to generate query-agnostic row representations that could be re-used via query-specific aggregation. In addition to the row-based architecture, we introduce several techniques: cell-aware position embedding, AutoEncoder objective in transformer models, teacher-student training paradigm, and selective backward to improve the performance of RoTaR model. |
Zui Chen · Lei Cao · Samuel Madden 🔗 |
-
|
SiMa: Federating Data Silos using GNNs
(
Poster
)
link »
Virtually every sizable organization nowadays is building a form of a data lake. In theory, every department or team in the organization would enrich their datasets with metadata, and store them in a central data lake. Those datasets can then be combined in different ways and produce added value to the organization. In practice, though, the situation is vastly different: each department has its own privacy policies, data release procedures, and goals. As a result, each department maintains its own data lake, leading to data silos. For such data silos to be of any use, they need to be integrated. This paper presents SiMa, a method for federating data silos that consistently finds more correct relationships than the state-of-the-art matching methods, while minimizing wrong predictions and requiring 20x to 1000x less time to execute. SiMa leverages Graph Neural Networks (GNNs) to learn from the existing column relationships and automated data profiles found in data silos. Our method makes use of the trained GNN to perform link prediction and find new column relationships across data silos. Most importantly, SiMa can be trained incrementally on the column relationships within each silo individually, and does not require consolidating all datasets into one place. |
Christos Koutras · Rihan Hai · Kyriakos Psarakis · Marios Fragkoulis · Asterios Katsifodimos 🔗 |
-
|
STUNT: Few-shot Tabular Learning with Self-generated Tasks from Unlabeled Tables
(
Poster
)
link »
Learning with few-labeled tabular samples is an essential requirement for industrial machine learning applications as varieties of tabular data suffer from high annotation costs or have difficulties in collecting new samples for novel tasks. Despite the utter importance, such a problem is quite under-explored in the field of tabular learning, and existing few-shot learning schemes from other domains are not straightforward to apply, mainly due to the heterogeneous characteristics of tabular data. In this paper, we propose a simple yet effective framework for few-shot tabular learning, coined Self-generated Tasks from UNlabeled Tables (STUNT). Our key idea is to self-generate diverse few-shot tasks by treating randomly chosen columns as a target label. We then employ a meta-learning scheme to learn generalizable knowledge over the constructed tasks. Moreover, we introduce an unsupervised validation scheme for hyperparameter search (and early stopping) by generating a pseudo-validation set using STUNT from unlabeled data. Our experimental results demonstrate that our simple framework brings significant performance gain under various tabular few-shot learning benchmarks, compared to prior semi- and self-supervised baselines. |
Jaehyun Nam · Jihoon Tack · Kyungmin Lee · Hankook Lee · Jinwoo Shin 🔗 |
-
|
Analysis of the Attention in Tabular Language Models
(
Poster
)
link »
Recent transformer-based models for learning table representation have reported state-of-the-art results for different tasks such as table understanding, question answering and semantic parsing. The various proposed models use different architectures, specifically different attention mechanisms. In this paper, we analyze and compare the attention mechanisms used by two different tabular language models. By visualizing the attention maps of the models, we shed a light on the different patterns that the models exhibit. With our analysis on the aggregate attention over two tabular datasets, we provide insights which might help towards building more efficient models tailored for table representation learning. |
Aneta Koleva · Martin Ringsquandl · Volker Tresp 🔗 |
-
|
Towards Foundation Models for Relational Databases [Vision Paper]
(
Poster
)
link »
Tabular representation learning has recently gained a lot of attention. However, existing approaches only learn a representation from a single table, and thus ignore the potential to learn from the full structure of relational databases, including neighboring tables that can contain important information for a contextualized representation. Moreover, current models are significantly limited in scale, which prevents that they learn from large databases. In this paper, we thus introduce our vision of relational representation learning, that can not only learn from the full relational structure, but also can scale to larger database sizes that are commonly found in real-world. Moreover, we also discuss opportunities and challenges we see along the way to enable this vision and present initial very promising results. Overall, we argue that this direction can lead to foundation models for relational databases that are today only available for text and images. |
Liane Vogel · Benjamin Hilprecht · Carsten Binnig 🔗 |
-
|
Transfer Learning with Deep Tabular Models
(
Poster
)
link »
Recent work on deep learning for tabular data demonstrates the strong performance of deep tabular models, often bridging the gap between gradient boosted decision trees and neural networks. Accuracy aside, a major advantage of neural models is that they are easily fine-tuned in new domains and learn reusable features. This property is often exploited in computer vision and natural language applications, where transfer learning is indispensable when task-specific training data is scarce. In this work, we explore the benefits that representation learning provides for knowledge transfer in the tabular domain. We conduct experiments in a realistic medical diagnosis test bed with limited amounts of downstream data and find that transfer learning with deep tabular models provides a definitive advantage over gradient boosted decision tree methods. We further compare the supervised and self-supervised pretraining strategies and provide practical advice on transfer learning with tabular models. Finally, we propose a pseudo-feature method for cases where the upstream and downstream feature sets differ, a tabular-specific problem widespread in real-world applications. |
Roman Levin · Valeriia Cherepanova · Avi Schwarzschild · Arpit Bansal · C. Bayan Bruss · Tom Goldstein · Andrew Wilson · Micah Goldblum 🔗 |
-
|
RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild
(
Poster
)
link »
Recent advances in self-supervised learning (SSL) using large models to learn visual representations from natural images are rapidly closing the gap between the results produced by fully supervised learning and those produced by self-supervised learning on downstream vision tasks. Inspired by this advancement and primarily motivated by the emergence of tabular and structured document image applications, we question which pretraining objectives without supervision, architectures, and fine-tuning strategies are most effective. To address these questions, we introduce \ours~a new self-supervised framework that combines contrastive and regularized methods and is compatible with the standard Vision Transformer (ViT)~\citep{Dosovitskiy21} architecture. Then, \ours~is instantiated by integrating masked autoencoders (MAE)~\citep{He22} as a representative example of a contrastive method and enhanced Barlow Twins (eBT) as a representative example of a regularized method with configurable input image augmentations in both branches. Several real-world table recognition scenarios (e.g., extracting tables from document images), ranging from standard Word and Latex documents to even more challenging electronic health records (EHR) computer screen images, have been shown to benefit greatly from the representations learned from this new framework, with detection AP improving relatively by 4.8\% for table, 11.8\% for table column, and 11.1\% for GUI objects over a previous fully supervised baseline on real-world EHR screen images. |
Weiyao Wang · Byung-Hak Kim · Varun Ganapathi 🔗 |
-
|
Diffusion models for missing value imputation in tabular data
(
Poster
)
link »
Missing value imputation in machine learning is the task of estimating the missing values in the dataset reasonably using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid on the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on a recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called ``Conditional Score-based Diffusion Models for Tabular data'' (CSDIT). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bit encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of CSDIT compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques. |
Shuhan Zheng · Nontawat Charoenphakdee 🔗 |
-
|
STab: Self-supervised Learning for Tabular Data
(
Poster
)
link »
Self-supervised learning has drawn recent interest for learning generalizable, transferable and robust representations from unlabeled tabular data. Unfortunately, unlike its image and language counterparts which have unique spatial or semantic structure information, it is difficult to design an effective augmentation method generically beneficial to downstream tasks in the tabular setting, owing to its lack of common structure and diverse nature. On the other hand, most existing augmentation methods are domain-specific (such as rotation in vision, token masking for NLP, and edge dropping for graphs), making them less effective for real-world tabular data. This significantly limits tabular self-supervised learning and hinders progress in this domain. Aiming to fill this crucial gap, we propose STab, an augmentation-free self-supervised representation learning based on stochastic regularization techniques that does not rely on negative pairs, to capture highly heterogeneous and non-structured information in tabular data. Our experiments show that STab achieves state-of-the-art performance compared to existing contrastive and pretext task self-supervised methods. |
Ehsan Hajiramezanali · Max Shen · Gabriele Scalia · Nathaniel Diamant 🔗 |
-
|
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
(
Poster
)
link »
We present TabPFN, a trained Transformer model that can do tabular supervised classification for small datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods.TabPFN is entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. Our prior incorporates ideas from causal learning: It entails a large space of structural causal models with a preference for simple structures. Afterwards, the trained TabPFN approximates Bayesian prediction on any unseen tabular dataset, without any hyperparameter tuning or gradient-based learning.On 30 datasets from the OpenML-CC18 suite, we show that our method outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with a $70\times$ speedup. This increases to a $3\,200\times$ speedup when a GPU is available.We provide all our code and the trained TabPFN at https://anonymous.4open.science/r/TabPFN-2AEE. We also provide an online demo at https://huggingface.co/spaces/TabPFN/TabPFNPrediction.
|
Noah Hollmann · Samuel Müller · Katharina Eggensperger · Frank Hutter 🔗 |
-
|
MapQA: A Dataset for Question Answering on Choropleth Maps
(
Poster
)
link »
Choropleth maps are a common visual representation for region-specific tabular data and are used in a number of different venues (newspapers, articles, etc). These maps are human-readable but are often challenging to deal with when trying to extract data for screen readers, analyses, or other related tasks. Recent research into Visual-Question Answering (VQA) has studied question answering on human-generated charts (ChartQA), such as bar, line, and pie charts. However, little work has paid attention to understanding maps; general VQA models, and ChartQA models, suffer when asked to perform this task. To facilitate and encourage research in this area, we present MapQA, a large-scale dataset of ~800,000 question-answer pairs over ~60,000 map images. Our task tests various levels of map understanding, from surface questions about map styles to complex questions that require reasoning on the underlying data. We present the unique challenges of MapQA that frustrate most strong baseline algorithms designed for ChartQA and general VQA tasks. We also present a novel algorithm, Visual Multi-Output Data Extraction based QA (V-MODEQA) for MapQA. V-MODEQA extracts the underlying structured data from a map image with a multi-output model and then performs reasoning on the extracted data. Our experimental results show that V-MODEQA has better overall performance and robustness on MapQA than the state-of-the-art ChartQA and VQA algorithms by capturing the unique properties in map question answering. |
Shuaichen Chang · David Palzer · Jialin Li · Eric Fosler-Lussier · Ningchuan Xiao 🔗 |
-
|
Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning
(
Poster
)
link »
Data wrangling tasks for data integration and cleaning arise in virtually every data-driven application scenario nowadays. Recent research indicated the astounding potential of Large Language Models (LLMs) for such tasks. The automation of data wrangling with LLMs poses additional challenges, however, as hand-tuning task and data-specific prompts for LLMs requires high expertise and manual effort. On the other hand, finetuning a whole LLM is more amenable to automation, but incurs high storage costs, as a copy of the LLM has to be maintained.In this work, we explore the potential of a lightweight alternative to finetuning an LLM, which automatically learns a continuous prompt. This approach called prefix-tuning does not require updating the original LLM parameters, and can therefore re-use a single LLM instance across tasks. At the same time, it is amenable to automation, as continuous prompts can be automatically learned with standard techniques.We evaluate prefix-tuning on common data wrangling tasks for tabular data such as entity matching, error detection, and data imputation, with promising results. We find that in six out of ten cases, prefix-tuning is within 2.3% of the performance of finetuning, even though it leverages only 0.39% of the parameter updates required for finetuning the full model. These results highlight the potential of prefix-tuning as a parameter-efficient alternative to finetuning for data integration and data cleaning with LLMs. |
David Vos · Till Döhmen · Sebastian Schelter 🔗 |
-
|
CASPR: Customer Activity Sequence based Prediction and Representation
(
Poster
)
link »
Applications critical to enterprise profitability such as customer churn prediction, fraudulent account detection, customer lifetime value estimation etc. are typically addressed by training dedicated supervised models using features engineered from tabular data containing customer information. Creating custom feature sets tuned to each applications has the overhead of development, operationalization as well as maintenance over time. Recent advances made in representation learning have the potential to simplify the feature engineering process across various applications. However, it is challenging to apply these methods to tabular data due to issues such as data heterogenity, variations in engagement history across customers and the large size of enterprise data. In this paper, we propose a novel approach to encode tabular data containing customer transactions, purchase history and other interactions into a generic representation of a customer's association with the business and use these embeddings as features to train multiple models spanning a variety of applications. CASPR, Customer Activity Sequence based Prediction and Representation, extends Transformer architecture to encode activity sequences to improve model performance and avoid bespoke feature engineering across applications. Our experiments with running CASPR at scale show it is suitable for both small & large enterprise data. |
Damian Kowalczyk · Pin-Jung Chen · Sahil Bhatnagar 🔗 |
-
|
MET: Masked Encoding for Tabular Data
(
Poster
)
link »
This paper proposes $\textit{Masked Encoding for Tabular Data (MET)}$ for learning self-supervised representations from $\textit{tabular data}$. Tabular self-supervised learning (tabular-SSL) -- unlike structured domains like images, audio, text -- is more challenging, since each tabular dataset can have a completely different structure among its features (or coordinates), that is hard to identify a priori. MET attempts to circumvent this problem by assuming the following hypothesis: the observed tabular data features come from a latent graphical model and the downstream tasks are significantly easier to solve in the latent space. Based on this hypothesis, MET uses random masking based encoders to learn a positional embedding for each coordinate, which would in turn capture the latent structure between coordinates. Extensive experiments on multiple standard benchmarks for tabular data demonstrate that MET significantly outperforms all the current baselines. For example, on Criteo dataset -- a large-scale click prediction dataset -- MET achieves as much as $5\%$ improvement over the current state-of-the-art (SOTA) while purely supervised learning based approaches have been able to advance SOTA by at most $1\%$ in the last few years. Furthermore, MET can be $>20\%$ more accurate than Gradient-boosted decision trees -- considered as a SOTA method for the tabular setting -- on multiple benchmarks.
|
Kushal Majmundar · Sachin Goyal · Praneeth Netrapalli · Prateek Jain 🔗 |
-
|
Conditional Contrastive Networks
(
Poster
)
link »
A vast amount of structured information associated with unstructured data, such as images or text, is stored online. This structured information implies different similarity relationships among unstructured data. Recently, contrastive learned embeddings trained on web-scraped unstructured data have been shown to have state-of-the-art performance across computer vision tasks. However, contrastive learning methods are currently able to leverage only a single metric of similarity. In this paper, we propose conditional contrastive networks (CCNs) as a way of using multiple notions of similarity in structured data. Our novel conditional contrastive loss is able to learn multiple disjoint similarity notions by projecting each similarity notion into a different subspace. We show empirically that our CCNs perform better than single-label trained cross-entropy networks, single-label trained supervised-contrastive networks, multi-task trained cross-entropy networks, and previously proposed conditional similarity networks on both the attributes on which it was trained and on unseen attributes. |
Emily Mu · John Guttag 🔗 |
-
|
Structural Embedding of Data Files with MAGRITTE
(
Poster
)
link »
Large amounts of tabular data are encoded in plain-text files, e.g., CSV, TSV andTXT. Plain-text formats allow freedom of expression and encoding, fostering theuse of non-standard syntaxes and dialects. Before analyzing the content of suchfiles, it is necessary to understand their structure, e.g., recognize their dialect,extract metadata, or detect tables. Previous work on table representationfocused on learning the semantics of data cells , with the assumption thatthe syntactical properties of a file are known to end users.We propose MAGRiTTE, an approach to synthetically represent the structural featuresof a data file. MAGRiTTE is a self-supervised machine learning model trainedto learn structural embeddings from data files. The architecture of MAGRiTTEis composed of two components. The first is a transformer-encoder architecture,based on BERT and pre-trained to learn row embeddings. The second is aDCGAN-autoencoder trained to produce file-level embeddings. To pre-train thetransformer architecture on structural features, we propose two core adaptations: anovel tokenization stage and specialized training objectives. To abstract the datacontent of a file, and train the transformer architecture on structural features, we introduce“pattern tokenization”: Assuming that structural properties are identifiablethrough special characters, we reduce all alphanumeric characters to a set of fewgeneral patterns. After tokenization, the rows of the input files are split on newlinecharacters and a percentage of the special character tokens is masked before feedingit to the row encoder model. The row-transformer model is then trained on twoobjectives, reconstructing the masked tokens, and identifying whether pairs of rowsbelong to the same file. The row embeddings produced by this model are thenused as the input for the file embedding stage of MAGRiTTE. In this stage, thegenerator and discriminator models are trained in an adversarial fashion on the rowembeddings feature maps. To obtain a file-wise embedding vector, we concatenatethe output features produced from all convolutional stages of the discriminator.We shall evaluate the effectiveness of our learned structural representations on threetasks to analyze unseen data files: (1) fine-grained dialect detection, i.e., identifyingthe structural role of characters within rows, (2) line and cell classification, i.e.,identifying metadata, comments, and data within a file, (3) table extraction, i.e.,identifying the boundaries of tabular regions. We compare the use of MAGRiTTEencodings with state-of-the-art approaches that were specifically designed for thesetasks. In future work, we aim at using MAGRiTTE embeddings to automaticallyperform structural data preparation, e.g., extracting tables, removing unwantedrows, or changing file dialects. |
Gerardo Vitagliano · Mazhar Hameed · Felix Naumann 🔗 |
-
|
Active Learning with Table Language Models
(
Poster
)
link »
Despite recent advancements in table language models research, their real world application is still challenging. In industry, there is an abundance of tables found in spreadsheets, but acquisition of substantial amounts of labels is expensive, since only experts can annotate the often highly technical and domain-specific tables. Active learning could potentially reduce labeling costs, however, so far there are no works related to active learning in conjunction with table language models. In this paper we investigate different query strategies in a real-world industrial table language model use case. Our results show that there is potential for improvement and some fundamental questions to be addressed. |
Martin Ringsquandl · Aneta Koleva 🔗 |
-
|
Self-supervised Representation Learning Across Sequential and Tabular Features Using Transformers
(
Poster
)
link »
Machine learning models used for predictive modeling tasks spanning across personalization, recommender systems, ad response prediction, fraud detection etc. typically require a variety of tabular as well as sequential activity features about the user. For tasks like click-through or conversion (purchase) rate prediction where labeled data is available at scale, popular methods use deep sequence models (sometimes pre-trained) to encode sequential inputs, followed by concatenation with tabular features and optimization of a supervised training objective. For tasks like bot and fraud detection, where labeled data is sparse and incomplete, the typical approach is to use self-supervision to learn user embeddings from their historical activity sequence. However, these models are not equipped to handle tabular input features during self-supervised learning. In this paper, we propose a novel Transformer architecture that can jointly learn embeddings on both sequential and tabular input features. Our model learns self-supervised user embeddings using masked token prediction objective on a rich variety of features without relying on any labeled data. We demonstrate that user embeddings generated by the proposed technique are able to successfully encode information from a combination of sequential and tabular features, improving AUC-ROC for linear separability for a downstream task label by $5\%$ over embeddings generated using sequential features only. We also benchmark the efficacy of the embeddings on the bot detection task for a large-scale digital advertising program, where the proposed model improves recall over known bots by $10\%$ over the sequential only baseline at the same False Positive Rate (FPR).
|
Rajat Agarwal · Anand Muralidhar · Agniva Som · Hemant Kowshik 🔗 |
-
|
Self Supervised Pre-training for Large Scale Tabular Data
(
Poster
)
link »
In this paper, we tackle the problem of self supervised pre-training of deep neural networks for large scale tabular data in online advertising. Self supervised learning has recently been very effective for pre-training representations in domains such as vision, natural language processing, etc. But unlike these, designing self supervised learning tasks for tabular data is inherently challenging. Tabular data can consist of various types of data with high cardinality and range of feature values especially in a large scale real world setting. To that end, we propose a self supervised pre-training strategy that utilizes Manifold Mixup to produce data augmentations for tabular data and perform reconstruction on these augmentations using noise contrastive estimation and mean absolute error losses, both of which are particularly suitable for large scale tabular data. We demonstrate its efficacy by evaluating on the problem of click fraud detection on ads to obtain an improvement of $9\%$ over a supervised learning baseline and $4\%$ over a contrastive learning experiment.
|
Sharad Chitlangia · Anand Muralidhar · Rajat Agarwal 🔗 |
-
|
STable: Table Generation Framework for Encoder-Decoder Models
(
Poster
)
link »
The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%. |
Michał Pietruszka · Michał Turski · Łukasz Borchmann · Tomasz Dwojak · Gabriela Pałka · Karolina Szyndler · Dawid Jurkiewicz · Łukasz Garncarek 🔗 |
-
|
Tabular Data Generation: Can We Fool XGBoost ?
(
Poster
)
link »
If by 'realistic' we mean indistinguishable from (fresh) real data, generating realistic synthetic tabular data is far from being a trivial task. We present here a series of experiments showing that strong classifiers like XGBoost are able to distinguish state-of-the-art synthetic data from fresh real data almost perfectly on several tabular datasets. By studying the important features of these classifiers, we remark that mixed-type (continuous/discrete) and ill-distributed numerical columns are the ones which are the less faithfully reconstituted. We hence propose and experiment a series of automated reversible column-wise encoders which improve the realism of the generators. |
EL Hacen Zein · Tanguy Urvoy 🔗 |