Contributed Talk
in
Workshop: Machine Learning with Guarantees
Vatsal Sharan, "Sample Amplification: Increasing Dataset Size even when Learning is Impossible"
Vatsal Sharan
[
Abstract
]
Abstract:
Given data drawn from an unknown distribution, $D$, to what extent is it possible to ``amplify'' this dataset and faithfully output a larger set of samples that appear to have been drawn from $D$? We formalize this question as follows: an $(n,m)$ amplification procedure takes as input $n$ independent draws from an unknown distribution $D$, and outputs a set of $m > n$ ``samples'' which must be indistinguishable from $m$ samples drawn i.i.d. from $D$. We consider this sample amplification problem in two fundamental settings: the case where $D$ is an arbitrary discrete distribution supported on $k$ elements, and the case where $D$ is a $d$-dimensional Gaussian with unknown mean, and fixed covariance matrix. Perhaps surprisingly, we show a valid amplification procedure exists for both of these settings, even in the regime where the size of the input dataset, $n$, is significantly less than what would be necessary to learn distribution $D$ to non-trivial accuracy. We also show that our procedures are optimal up to constant factors. Beyond these results, we also formalize a number of curious directions for future research along this vein.
Live content is unavailable. Log in and register to view live content