Skip to yearly menu bar Skip to main content


Poster

Data Cleansing for Models Trained with SGD

Satoshi Hara · Atsushi Nitanda · Takanori Maehara

East Exhibition Hall B + C #48

Keywords: [ Algorithms -> AutoML; Applications -> Fairness, Accountability, and Transparency; Optimization ] [ Stochastic Optimization ] [ Algorithms ] [ Classification ]


Abstract:

Data cleansing is a typical approach used to improve the accuracy of machine learning models, which, however, requires extensive domain knowledge to identify the influential instances that affect the models. In this paper, we propose an algorithm that can identify influential instances without using any domain knowledge. The proposed algorithm automatically cleans the data, which does not require any of the users' knowledge. Hence, even non-experts can improve the models. The existing methods require the loss function to be convex and an optimal model to be obtained, which is not always the case in modern machine learning. To overcome these limitations, we propose a novel approach specifically designed for the models trained with stochastic gradient descent (SGD). The proposed method infers the influential instances by retracing the steps of the SGD while incorporating intermediate models computed in each step. Through experiments, we demonstrate that the proposed method can accurately infer the influential instances. Moreover, we used MNIST and CIFAR10 to show that the models can be effectively improved by removing the influential instances suggested by the proposed method.

Live content is unavailable. Log in and register to view live content