Skip to yearly menu bar Skip to main content

Workshop: 4th Workshop on Self-Supervised Learning: Theory and Practice

Can semi-supervised learning use all the data effectively? A lower bound perspective

Alexandru Tifrea · Gizem YĆ¼ce · Amartya Sanyal · Fanny Yang


In the semi-supervised learning (SSL) setting both labeled and unlabeled datasets are available to the learning algorithm. While it is well-established from prior theoretical and empirical works that the inclusion of unlabeled data can help to improve over the error of supervised learning algorithms, existing theoretical examinations of SSL suggest a limitation: these algorithms might not efficiently leverage labeled data beyond a certain threshold. In this study, we derive a tight lower bound for 2-Gaussian mixture model distributions which exhibits an explicit dependence on the sizes of both the labeled and the unlabeled dataset. Surprisingly, our lower bound indicates that no SSL algorithm can surpass the sample complexities of minimax optimal supervised (SL) or unsupervised learning (UL) algorithms, which exclusively use either the labeled or the unlabelled dataset, respectively. Despite a change in the statistical error rate being unattainable, SSL can still outperform both SL and UL (up to permutation) in terms of absolute error. To this end, we provide evidence that there exist algorithms that can provably achieve lower error than both SL and UL algorithms. We validate our theoretical findings through linear classification experiments on synthetic and real-world data.

Chat is not available.