Timezone: »
We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge.
Author Information
Xingjian Shi (Amazon Web Services)
Jonas Mueller (Amazon Web Services)
Nick Erickson (Amazon Web Services)
I am the lead developer and co-author of AutoGluon: https://github.com/awslabs/autogluon/
Mu Li (Amazon)
Alexander Smola (Amazon)
**AWS Machine Learning**
More from the Same Authors
-
2021 : Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks »
Curtis Northcutt · Anish Athalye · Jonas Mueller -
2021 Spotlight: Mixture Proportion Estimation and PU Learning:A Modern Approach »
Saurabh Garg · Yifan Wu · Alexander Smola · Sivaraman Balakrishnan · Zachary Lipton -
2021 : Robust Reinforcement Learning for Shifting Dynamics During Deployment »
Samuel Stanton · Rasool Fakoor · Jonas Mueller · Andrew Gordon Wilson · Alexander Smola -
2022 : RLSBench: A Large-Scale Empirical Study of Domain Adaptation Under Relaxed Label Shift »
Saurabh Garg · Nick Erickson · James Sharpnack · Alexander Smola · Sivaraman Balakrishnan · Zachary Lipton -
2022 : Benchmarking Robustness under Distribution Shift of Multimodal Image-Text Models »
Jielin Qiu · Yi Zhu · Xingjian Shi · Zhiqiang Tang · DING ZHAO · Bo Li · Mu Li -
2022 : Utilizing supervised models to infer consensus labels and their quality from data with multiple annotators »
Hui Wen Goh · Ulyana Tkachenko · Jonas Mueller -
2022 Spotlight: Lightning Talks 4A-3 »
Zhihan Gao · Yabin Wang · Xingyu Qu · Luziwei Leng · Mingqing Xiao · Bohan Wang · Yu Shen · Zhiwu Huang · Xingjian Shi · Qi Meng · Yupeng Lu · Diyang Li · Qingyan Meng · Kaiwei Che · Yang Li · Hao Wang · Huishuai Zhang · Zongpeng Zhang · Kaixuan Zhang · Xiaopeng Hong · Xiaohan Zhao · Di He · Jianguo Zhang · Yaofeng Tu · Bin Gu · Yi Zhu · Ruoyu Sun · Yuyang (Bernie) Wang · Zhouchen Lin · Qinghu Meng · Wei Chen · Wentao Zhang · Bin CUI · Jie Cheng · Zhi-Ming Ma · Mu Li · Qinghai Guo · Dit-Yan Yeung · Tie-Yan Liu · Jianxing Liao -
2022 Spotlight: Earthformer: Exploring Space-Time Transformers for Earth System Forecasting »
Zhihan Gao · Xingjian Shi · Hao Wang · Yi Zhu · Yuyang (Bernie) Wang · Mu Li · Dit-Yan Yeung -
2022 Poster: Adaptive Interest for Emphatic Reinforcement Learning »
Martin Klissarov · Rasool Fakoor · Jonas Mueller · Kavosh Asadi · Taesup Kim · Alexander Smola -
2022 Poster: Faster Deep Reinforcement Learning with Slower Online Network »
Kavosh Asadi · Rasool Fakoor · Omer Gottesman · Taesup Kim · Michael Littman · Alexander Smola -
2022 Poster: Graph Reordering for Cache-Efficient Near Neighbor Search »
Benjamin Coleman · Santiago Segarra · Alexander Smola · Anshumali Shrivastava -
2022 Poster: Earthformer: Exploring Space-Time Transformers for Earth System Forecasting »
Zhihan Gao · Xingjian Shi · Hao Wang · Yi Zhu · Yuyang (Bernie) Wang · Mu Li · Dit-Yan Yeung -
2022 Expo Workshop: AutoGluon: Empowering (MultiModal) AutoML for the next 10 Million users »
Xingjian Shi · Nick Erickson · Caner Turkmen · Yi Zhu -
2021 Poster: Mixture Proportion Estimation and PU Learning:A Modern Approach »
Saurabh Garg · Yifan Wu · Alexander Smola · Sivaraman Balakrishnan · Zachary Lipton -
2021 Poster: Deep Explicit Duration Switching Models for Time Series »
Abdul Fatir Ansari · Konstantinos Benidis · Richard Kurle · Ali Caner Turkmen · Harold Soh · Alexander Smola · Bernie Wang · Tim Januschowski -
2021 Poster: Blending Anti-Aliasing into Vision Transformer »
Shengju Qian · Hao Shao · Yi Zhu · Mu Li · Jiaya Jia -
2021 Poster: Continuous Doubly Constrained Batch Reinforcement Learning »
Rasool Fakoor · Jonas Mueller · Kavosh Asadi · Pratik Chaudhari · Alexander Smola -
2021 Poster: Progressive Coordinate Transforms for Monocular 3D Object Detection »
Li Wang · Li Zhang · Yi Zhu · Zhi Zhang · Tong He · Mu Li · Xiangyang Xue -
2021 Poster: Deep Extended Hazard Models for Survival Analysis »
Qixian Zhong · Jonas Mueller · Jane-Ling Wang -
2021 Poster: Overinterpretation reveals image classification model pathologies »
Brandon Carter · Siddhartha Jain · Jonas Mueller · David Gifford -
2021 : Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks »
Curtis Northcutt · Anish Athalye · Jonas Mueller -
2020 Poster: Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation »
Rasool Fakoor · Jonas Mueller · Nick Erickson · Pratik Chaudhari · Alexander Smola -
2020 Poster: CSER: Communication-efficient SGD with Error Reset »
Cong Xie · Shuai Zheng · Sanmi Koyejo · Indranil Gupta · Mu Li · Haibin Lin -
2019 : Invited Talk - Alexander J. Smola - Sets and symmetries »
Alexander Smola -
2017 : TBA11 »
Alexander Smola -
2017 Oral: Deep Sets »
Manzil Zaheer · Satwik Kottur · Siamak Ravanbakhsh · Barnabas Poczos · Ruslan Salakhutdinov · Alexander Smola -
2017 Poster: Deep Sets »
Manzil Zaheer · Satwik Kottur · Siamak Ravanbakhsh · Barnabas Poczos · Ruslan Salakhutdinov · Alexander Smola -
2016 : Contributed Talk 1: Learning Optimal Interventions »
Jonas Mueller -
2016 Poster: Variance Reduction in Stochastic Gradient Langevin Dynamics »
Kumar Avinava Dubey · Sashank J. Reddi · Sinead Williamson · Barnabas Poczos · Alexander Smola · Eric Xing -
2016 Poster: Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization »
Sashank J. Reddi · Suvrit Sra · Barnabas Poczos · Alexander Smola -
2015 : Scaling Machine Learning »
Alexander Smola -
2015 Workshop: Nonparametric Methods for Large Scale Representation Learning »
Andrew G Wilson · Alexander Smola · Eric Xing -
2015 Poster: Fast and Guaranteed Tensor Decomposition via Sketching »
Yining Wang · Hsiao-Yu Tung · Alexander Smola · Anima Anandkumar -
2015 Spotlight: Fast and Guaranteed Tensor Decomposition via Sketching »
Yining Wang · Hsiao-Yu Tung · Alexander Smola · Anima Anandkumar -
2015 Poster: Principal Differences Analysis: Interpretable Characterization of Differences between Distributions »
Jonas Mueller · Tommi Jaakkola -
2015 Poster: On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants »
Sashank J. Reddi · Ahmed Hefny · Suvrit Sra · Barnabas Poczos · Alexander Smola -
2014 Poster: Communication Efficient Distributed Machine Learning with the Parameter Server »
Mu Li · David G Andersen · Alexander Smola · Kai Yu -
2014 Poster: Spectral Methods for Indian Buffet Process Inference »
Hsiao-Yu Tung · Alexander Smola -
2013 Workshop: Topic Models: Computation, Application, and Evaluation »
David Mimno · Amr Ahmed · Jordan Boyd-Graber · Ankur Moitra · Hanna Wallach · Alexander Smola · David Blei · Anima Anandkumar -
2013 Workshop: Randomized Methods for Machine Learning »
David Lopez-Paz · Quoc V Le · Alexander Smola -
2013 Workshop: Modern Nonparametric Methods in Machine Learning »
Arthur Gretton · Mladen Kolar · Samory Kpotufe · John Lafferty · Han Liu · Bernhard Schölkopf · Alexander Smola · Rob Nowak · Mikhail Belkin · Lorenzo Rosasco · peter bickel · Yue Zhao -
2013 Poster: Variance Reduction for Stochastic Gradient Optimization »
Chong Wang · Xi Chen · Alexander Smola · Eric Xing -
2012 Workshop: Confluence between Kernel Methods and Graphical Models »
Le Song · Arthur Gretton · Alexander Smola -
2012 Session: Oral Session 10 »
Alexander Smola -
2012 Poster: Learning Networks of Heterogeneous Influence »
Nan Du · Le Song · Alexander Smola · Ming Yuan -
2012 Poster: FastEx: Fast Clustering with Exponential Families »
Amr Ahmed · Sujith Ravi · Shravan M Narayanamurthy · Alexander Smola -
2012 Spotlight: Learning Networks of Heterogeneous Influence »
Nan Du · Le Song · Alexander Smola · Ming Yuan -
2011 Workshop: Big Learning: Algorithms, Systems, and Tools for Learning at Scale »
Joseph E Gonzalez · Sameer Singh · Graham Taylor · James Bergstra · Alice Zheng · Misha Bilenko · Yucheng Low · Yoshua Bengio · Michael Franklin · Carlos Guestrin · Andrew McCallum · Alexander Smola · Michael Jordan · Sugato Basu -
2011 Tutorial: Graphical Models for the Internet »
Amr Ahmed · Alexander Smola -
2010 Workshop: Challenges of Data Visualization »
Barbara Hammer · Laurens van der Maaten · Fei Sha · Alexander Smola -
2010 Poster: Word Features for Latent Dirichlet Allocation »
James Petterson · Alexander Smola · Tiberio Caetano · Wray L Buntine · Shravan M Narayanamurthy -
2010 Poster: Optimal Web-Scale Tiering as a Flow Problem »
Gilbert Leung · Novi Quadrianto · Alexander Smola · Kostas Tsioutsiouliklis -
2010 Poster: Multitask Learning without Label Correspondences »
Novi Quadrianto · Alexander Smola · Tiberio Caetano · S.V.N. Vishwanathan · James Petterson -
2010 Poster: Parallelized Stochastic Gradient Descent »
Martin A Zinkevich · Markus Weimer · Alexander Smola · Lihong Li -
2009 Workshop: Large-Scale Machine Learning: Parallelism and Massive Datasets »
Alexander Gray · Arthur Gretton · Alexander Smola · Joseph E Gonzalez · Carlos Guestrin -
2009 Poster: Slow Learners are Fast »
Martin A Zinkevich · Alexander Smola · John Langford -
2009 Poster: Distribution Matching for Transduction »
Novi Quadrianto · James Petterson · Alexander Smola -
2008 Poster: Kernelized Sorting »
Novi Quadrianto · Le Song · Alexander Smola -
2008 Poster: Kernel Measures of Independence for non-iid Data »
Xinhua Zhang · Le Song · Arthur Gretton · Alexander Smola -
2008 Spotlight: Kernelized Sorting »
Novi Quadrianto · Le Song · Alexander Smola -
2008 Spotlight: Kernel Measures of Independence for non-iid Data »
Xinhua Zhang · Le Song · Arthur Gretton · Alexander Smola -
2008 Poster: Tighter Bounds for Structured Estimation »
Olivier Chapelle · Chuong B Do · Quoc V Le · Alexander Smola · Choon Hui Teo -
2008 Poster: Robust Near-Isometric Matching via Structured Learning of Graphical Models »
Julian J McAuley · Tiberio Caetano · Alexander Smola -
2007 Workshop: Representations and Inference on Probability Distributions »
Kenji Fukumizu · Arthur Gretton · Alexander Smola -
2007 Poster: Convex Learning with Invariances »
Choon Hui Teo · Amir Globerson · Sam T Roweis · Alexander Smola -
2007 Spotlight: A Kernel Statistical Test of Independence »
Arthur Gretton · Kenji Fukumizu · Choon Hui Teo · Le Song · Bernhard Schölkopf · Alexander Smola -
2007 Spotlight: Bundle Methods for Machine Learning »
Alexander Smola · Vishwanathan S V N · Quoc V Le -
2007 Poster: COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking »
Markus Weimer · Alexandros Karatzoglou · Quoc V Le · Alexander Smola -
2007 Oral: Colored Maximum Variance Unfolding »
Le Song · Alexander Smola · Karsten Borgwardt · Arthur Gretton -
2007 Poster: Colored Maximum Variance Unfolding »
Le Song · Alexander Smola · Karsten Borgwardt · Arthur Gretton -
2007 Poster: A Kernel Statistical Test of Independence »
Arthur Gretton · Kenji Fukumizu · Choon Hui Teo · Le Song · Bernhard Schölkopf · Alexander Smola -
2007 Poster: Bundle Methods for Machine Learning »
Alexander Smola · Vishwanathan S V N · Quoc V Le -
2007 Spotlight: COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking »
Markus Weimer · Alexandros Karatzoglou · Quoc V Le · Alexander Smola -
2007 Demonstration: Elefant »
Kishor Gawande · Alexander Smola · Vishwanathan S V N · Li Cheng · Simon A Guenter -
2007 Spotlight: Convex Learning with Invariances »
Choon Hui Teo · Amir Globerson · Sam T Roweis · Alexander Smola -
2006 Poster: A Kernel Method for the Two-Sample-Problem »
Arthur Gretton · Karsten Borgwardt · Malte J Rasch · Bernhard Schölkopf · Alexander Smola -
2006 Poster: Correcting Sample Selection Bias by Unlabeled Data »
Jiayuan Huang · Alexander Smola · Arthur Gretton · Karsten Borgwardt · Bernhard Schölkopf -
2006 Spotlight: Correcting Sample Selection Bias by Unlabeled Data »
Jiayuan Huang · Alexander Smola · Arthur Gretton · Karsten Borgwardt · Bernhard Schölkopf -
2006 Talk: A Kernel Method for the Two-Sample-Problem »
Arthur Gretton · Karsten Borgwardt · Malte J Rasch · Bernhard Schölkopf · Alexander Smola