Poster
in
Workshop: OPT 2025: Optimization for Machine Learning

One-Sided Matrix Completion from Ultra-Sparse Samples

Hongyang Zhang · Zhenshuo Zhang · Huy Nguyen · Guanghui Lan

Project Page [ OpenReview]

Abstract

Matrix completion is a classical problem that has received recurring interest from a wide range of areas. In this paper, we revisit this problem in an ultra-sparse sampling setting, where each entry of an (unknown) $n$ by $d$ matrix $M$, is observed with probability $p = \frac C d$, for an arbitrary constant $C \ge 2$ (assuming that $n \ge d$ without loss of generality). This setting is motivated by applications where large-scale panel datasets with high sparsity arise, and the number of rows (users) can be much larger than the number of columns (items). With a constant number of samples per row, recovering $M$ accurately is not impossible; Instead, we estimate the row space of $M$, given by the second-moment of the row vectors $T = \frac 1 n M^{\top} M$. The empirical second moment computed from the observed matrix involves non-random missingness and high sparsity. We design an unbiased estimator by normalizing the nonzero entries of the empirical second moment with their observed frequency, followed by gradient descent to impute the missing entries. The normalization divides a weighted sum of $n$ binomial random variables by the total number of ones in the binomial, resulting in a challenging analysis due to nonlinearity and sparsity. We provide estimation and recovery guarantees for our algorithm, showing that it is unbiased for any sampling probability $p$, and incurs low variance. Assuming the row vectors of $M$ are sampled from a rank-$r$ factor model, we prove that when $n \ge O(\frac{d r^5 \log d}{C^2\epsilon^2})$, our algorithm can recover $T$ with Frobenius norm error less than $\epsilon^2$, when the rank-$r$ factor model satisfies a standard incoherence condition. We also use the recovered row space as a sub-procedure to impute the missing entries of $M$. Experiments on both synthetic and real-world data are provided to evaluate this approach. When tested on three MovieLens datasets, our approach reduces bias by 88% relative to its alternatives. We also validate the linear sampling complexity of $n$ relative to $d$ on synthetic data. On an Amazon reviews dataset with sparsity $10^{-7}$, our approach reduces the recovery error of $T$ by 59% and $M$ by 38% compared to existing matrix completion methods.

Chat is not available.