Timezone: »
Data subset selection refers to the process of finding a small subset of training data such that the predictive performance of a classifier trained on it is close to that of a classifier trained on the full training data. A variety of sophisticated algorithms have been proposed specifically for data subset selection. A closely related problem is the active learning problem developed for semi-supervised learning.The key step of active learning is to identify an important subset of unlabeled data by making use of the currently available labeled data. This paper starts with a simple observation -- one can apply any off-the-shelf active learning algorithm in the context of data subset selection. The idea is very simple -- we pick a small random subset of data and pretend as if this random subset is the only labeled data, and the rest is not labeled. By pretending so, one can simply apply any off-the-shelf active learning algorithm. After each step of sample selection, we can reveal the label of the selected samples (as if we label the chosen samples in the original active learning scenario) and continue running the algorithm until one reaches the desired subset size. We observe that surprisingly, this active learning-based algorithm outperforms all the current data subset selection algorithms on the benchmark tasks. We also perform a simple controlled experiment to understand better why this approach works well. As a result, we find that it is crucial to find a balance between easy-to-classify and hard-to-classify examples when selecting a subset.
Author Information
Dongmin Park (Korea Advanced Institute of Science and Technology)
Dimitris Papailiopoulos (University of Wisconsin-Madison)
Kangwook Lee (UW Madison, Krafton)
More from the Same Authors
-
2022 : A Better Way to Decay: Proximal Gradient Training Algorithms for Neural Nets »
Liu Yang · Jifan Zhang · Joseph Shenouda · Dimitris Papailiopoulos · Kangwook Lee · Robert Nowak -
2023 Poster: Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy »
Dongmin Park · Seola Choi · Doyoung Kim · Hwanjun Song · Jae-Gil Lee -
2023 Poster: Dissecting Chain-of-Thought: A Study on Compositional In-Context Learning of MLPs »
Yingcong Li · Kartik Sreenivasan · Angeliki Giannou · Dimitris Papailiopoulos · Samet Oymak -
2023 Poster: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models »
Ying Fan · Olivia Watkins · Yuqing Du · Hao Liu · Moonkyung Ryu · Craig Boutilier · Pieter Abbeel · Mohammad Ghavamzadeh · Kangwook Lee · Kimin Lee -
2022 Poster: Meta-Query-Net: Resolving Purity-Informativeness Dilemma in Open-set Active Learning »
Dongmin Park · Yooju Shin · Jihwan Bang · Youngjun Lee · Hwanjun Song · Jae-Gil Lee -
2022 Poster: LIFT: Language-Interfaced Fine-Tuning for Non-language Machine Learning Tasks »
Tuan Dinh · Yuchen Zeng · Ruisu Zhang · Ziqian Lin · Michael Gira · Shashank Rajput · Jy-yong Sohn · Dimitris Papailiopoulos · Kangwook Lee -
2022 Poster: Score-based Generative Modeling Secretly Minimizes the Wasserstein Distance »
Dohyun Kwon · Ying Fan · Kangwook Lee -
2022 Poster: Rare Gems: Finding Lottery Tickets at Initialization »
Kartik Sreenivasan · Jy-yong Sohn · Liu Yang · Matthew Grinde · Alliot Nagle · Hongyi Wang · Eric Xing · Kangwook Lee · Dimitris Papailiopoulos -
2021 Poster: Task-Agnostic Undesirable Feature Deactivation Using Out-of-Distribution Data »
Dongmin Park · Hwanjun Song · Minseok Kim · Jae-Gil Lee -
2021 Poster: An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks »
Shashank Rajput · Kartik Sreenivasan · Dimitris Papailiopoulos · Amin Karbasi -
2020 Poster: Bad Global Minima Exist and SGD Can Reach Them »
Shengchao Liu · Dimitris Papailiopoulos · Dimitris Achlioptas -
2020 Poster: Attack of the Tails: Yes, You Really Can Backdoor Federated Learning »
Hongyi Wang · Kartik Sreenivasan · Shashank Rajput · Harit Vishwakarma · Saurabh Agarwal · Jy-yong Sohn · Kangwook Lee · Dimitris Papailiopoulos -
2020 Poster: Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient »
Ankit Pensia · Shashank Rajput · Alliot Nagle · Harit Vishwakarma · Dimitris Papailiopoulos -
2020 Spotlight: Optimal Lottery Tickets via Subset Sum: Logarithmic Over-Parameterization is Sufficient »
Ankit Pensia · Shashank Rajput · Alliot Nagle · Harit Vishwakarma · Dimitris Papailiopoulos -
2019 Poster: DETOX: A Redundancy-based Framework for Faster and More Robust Gradient Aggregation »
Shashank Rajput · Hongyi Wang · Zachary Charles · Dimitris Papailiopoulos -
2018 Poster: The Effect of Network Width on the Performance of Large-batch Training »
Lingjiao Chen · Hongyi Wang · Jinman Zhao · Dimitris Papailiopoulos · Paraschos Koutris -
2018 Poster: ATOMO: Communication-efficient Learning via Atomic Sparsification »
Hongyi Wang · Scott Sievert · Shengchao Liu · Zachary Charles · Dimitris Papailiopoulos · Stephen Wright -
2016 Poster: Cyclades: Conflict-free Asynchronous Machine Learning »
Xinghao Pan · Maximilian Lam · Stephen Tu · Dimitris Papailiopoulos · Ce Zhang · Michael Jordan · Kannan Ramchandran · Christopher RĂ© · Benjamin Recht -
2015 Poster: Orthogonal NMF through Subspace Exploration »
Megasthenis Asteris · Dimitris Papailiopoulos · Alex Dimakis -
2015 Poster: Sparse PCA via Bipartite Matchings »
Megasthenis Asteris · Dimitris Papailiopoulos · Anastasios Kyrillidis · Alex Dimakis -
2015 Poster: Parallel Correlation Clustering on Big Graphs »
Xinghao Pan · Dimitris Papailiopoulos · Samet Oymak · Benjamin Recht · Kannan Ramchandran · Michael Jordan