NeurIPS Poster Taming Fat-Tailed (“Heavier-Tailed” with Potentially Infinite Variance) Noise in Federated Learning

Poster

Taming Fat-Tailed (“Heavier-Tailed” with Potentially Infinite Variance) Noise in Federated Learning

Haibo Yang · Peiwen Qiu · Jia Liu

Hall J (level 1) #821

Keywords: [ Stochastic Gradient Descent ] [ federated learning ] [ heavy-tail ] [ Optimization ] [ clipping ]

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Abstract: In recent years, federated learning (FL) has emerged as an important distributed machine learning paradigm to collaboratively learn a global model with multiple clients, while keeping data local and private. However, a key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e.,

heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. This motivates us to fill this gap in this paper by proposing an algorithmic framework called

$\mathsf{FAT}$ -

$\mathsf{Clipping}~$ (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants:

$\mathsf{FAT}$ -

$\mathsf{Clipping}~$ per-round (

$\mathsf{FAT}$ -

$\mathsf{Clipping}$ -

$\mathsf{PR}$ ) and

$\mathsf{FAT}$ -

$\mathsf{Clipping}~$ per-iteration (

$\mathsf{FAT}$ -

$\mathsf{Clipping}$ -

$\mathsf{PI}$ ). Specifically, for the largest

$\alpha \in (1,2]$ such that the fat-tailed noise in FL still has a bounded

$\alpha$ -moment, we show that both variants achieve

$\mathcal{O}((mT)^{\frac{2-\alpha}{\alpha}})$ and

$\mathcal{O}((mT)^{\frac{1-\alpha}{3\alpha-2}})$ convergence rates in the strongly-convex and general non-convex settings, respectively, where

$m$ and

$T$ are the numbers of clients and communication rounds. Moreover, at the expense of more clipping operations compared to

$\mathsf{FAT}$ -

$\mathsf{Clipping}$ -

$\mathsf{PR}$ ,

$\mathsf{FAT}$ -

$\mathsf{Clipping}$ -

$\mathsf{PI}~$ further enjoys a linear speedup effect with respect to the number of local updates at each client and being lower-bound-matching (i.e., order-optimal). Collectively, our results advance the understanding of designing efficient algorithms for FL systems that exhibit fat-tailed first-order oracle information.

Chat is not available.