Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression
Tingkai Yan · Haodong Wen · Binghui Li · Kairong Luo · Wenguang Chen · Kaifeng Lyu
Abstract
Large Language Model (LLM) training often processes vast text corpora in a single pass, leaving much available data underutilized. This paper presents a theoretical analysis on how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws. Concretely, given a $K$-epoch training on $N$ samples, how many fresh samples would one-pass training require to match the same performance? We quantify this using the \textit{effective reuse rate} of the data, $E(K, N)$, which we define as the factor by which the dataset must grow under one-pass training to match the test loss of multi-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in two linear regression settings: (1) when the problem is strongly convex, we show that $E(K, N)$ grows proportionally to $\log N$ and saturates to $K$ when $\log N \gg K$; (2) for a class of data distributions with power-law Hessian spectrum, $E(K, N)$ exhibits a similar behavior but grows at a different rate before saturation. These theoretical findings complement a recent empirical study by [Muennighoff et al. (2023)](https://arxiv.org/abs/2305.16264), which found that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, \textit{i.e.}, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal how this behavior depends on dataset size and data distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.
Chat is not available.
Successful Page Load