Contributed Talks 3: Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression & Can SGD Handle Heavy-Tailed Noise?
Abstract
Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression, Speaker: Haodong Wen
Abstract: Large Language Model (LLM) training often processes vast text corpora in a single pass, leaving much available data underutilized. This paper presents a theoretical analysis on how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws. Concretely, given a K-epoch training on N samples, how many fresh samples would one-pass training require to match the same performance? We quantify this using the effective reuse rate of the data, E(K,N), which we define as the factor by which the dataset must grow under one-pass training to match the test loss of multi-epoch training. Our analysis precisely characterizes the scaling behavior of E(K,N) for SGD in two linear regression settings: (1) when the problem is strongly convex, we show that E(K,N) grows proportionally to log N and saturates to K when log N >> K; (2) for a class of data distributions with power-law Hessian spectrum, E(K,N) exhibits a similar behavior but grows at a different rate before saturation. These theoretical findings complement a recent empirical study by Muennighoff et al. (2023), which found that training LLMs for up to 4 epochs results in negligible loss differences compared to using fresh data at each step, i.e., E(K,N) approx K for K <= 4 in our notation. Supported by further empirical validation with LLMs, our results reveal how this behavior depends on dataset size and data distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.
Can SGD Handle Heavy-Tailed Noise?, Speaker: Florian Hübler
Abstract: Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise---common in modern machine learning and reinforcement learning---remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded p-th moments for some p \in (1,2], we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: O(epsilon^(-p/(p-1))) and O(epsilon^(-p/(2(p-1)))), respectively. For non-convex objectives under Hölder smoothness, we prove convergence to a stationary point with rate O(epsilon^(-2p/(p-1))), and complement this with a matching lower bound specific to SGD.