NeurIPS Poster Stochastic Approximation Approaches to Group Distributionally Robust Optimization

Poster

Stochastic Approximation Approaches to Group Distributionally Robust Optimization

Lijun Zhang · Peng Zhao · Zhen-Hua Zhuang · Tianbao Yang · Zhi-Hua Zhou

Great Hall & Hall B1+B2 (level 1) #1208

[ Abstract ]

[ Paper] [ OpenReview]

Abstract: This paper investigates group distributionally robust optimization (GDRO), with the purpose to learn a model that performs well over

m

$m$ different distributions. First, we formulate GDRO as a stochastic convex-concave saddle-point problem, and demonstrate that stochastic mirror descent (SMD), using

m

$m$ samples in each iteration, achieves an

O (m (\log m) / ϵ^{2})

$O(m (\log m)/\epsilon^2)$ sample complexity for finding an

ϵ

$\epsilon$ -optimal solution, which matches the

Ω (m / ϵ^{2})

$\Omega(m/\epsilon^2)$ lower bound up to a logarithmic factor. Then, we make use of techniques from online learning to reduce the number of samples required in each round from

m

$m$ to

1

$1$ , keeping the same sample complexity. Specifically, we cast GDRO as a two-players game where one player simply performs SMD and the other executes an online algorithm for non-oblivious multi-armed bandits. Next, we consider a more practical scenario where the number of samples that can be drawn from each distribution is different, and propose a novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates. Denote by

n_{i}

$n_i$ the sample budget for the

i

$i$ -th distribution, and assume

n_{1} \geq n_{2} \geq \dots \geq n_{m}

$n_1 \geq n_2 \geq \cdots \geq n_m$ . In the first approach, we incorporate non-uniform sampling into SMD such that the sample budget is satisfied in expectation, and prove that the excess risk of the

i

$i$ -th distribution decreases at an

O (\sqrt{n_{1} \log m} / n_{i})

$O(\sqrt{n_1 \log m}/n_i)$ rate. In the second approach, we use mini-batches to meet the budget exactly and also reduce the variance in stochastic gradients, and then leverage stochastic mirror-prox algorithm, which can exploit small variances, to optimize a carefully designed weighted GDRO problem. Under appropriate conditions, it attains an

O ((\log m) / \sqrt{n_{i}})

$O((\log m)/\sqrt{n_i})$ convergence rate, which almost matches the optimal

O (\sqrt{1 / n_{i}})

$O(\sqrt{1/n_i})$ rate of only learning from the

i

$i$ -th distribution with

n_{i}

$n_i$ samples.

Chat is not available.