Timezone: »

The Effect of Network Width on the Performance of Large-batch Training
Lingjiao Chen · Hongyi Wang · Jinman Zhao · Dimitris Papailiopoulos · Paraschos Koutris

Wed Dec 05 02:00 PM -- 04:00 PM (PST) @ Room 210 #30

Distributed implementations of mini-batch stochastic gradient descent (SGD) suffer from communication overheads, attributed to the high frequency of gradient updates inherent in small-batch training. Training with large batches can reduce these overheads; however it besets the convergence of the algorithm and the generalization performance.

In this work, we take a first step towards analyzing how the structure (width and depth) of a neural network affects the performance of large-batch training. We present new theoretical results which suggest that--for a fixed number of parameters--wider networks are more amenable to fast large-batch training compared to deeper ones. We provide extensive experiments on residual and fully-connected neural networks which suggest that wider networks can be trained using larger batches without incurring a convergence slow-down, unlike their deeper variants.

Author Information

Lingjiao Chen (University of Wisconsin-Madison)
Hongyi Wang (University of Wisconsin-Madison)
Jinman Zhao (University of Wisconsin-Madison)
Dimitris Papailiopoulos (University of Wisconsin-Madison)
Paraschos Koutris (University of Wisconsin-Madison)

More from the Same Authors