Important paper from Google on large batch optimization. They do impressively careful experiments measuring # iterations needed to achieve target validation error at various batch sizes. The main "surprise" is the lack of surprises. [thread]

The paper is a good example of lots of elements of good experimental design. They validate their metric by showing lots of variants give consistent results. They tune hyperparamters separately for each condition, check that optimum isn't at the endpoints, and measure sensitivity.
They have separate experiments where the hold fixed # iterations and # epochs, which (as they explain) measure very different things. They avoid confounds, such as batch norm's artificial dependence between batch size and regularization strength.
When the experiments are done carefully enough, the results are remarkably consistent between different datasets and architectures. Qualitatively, MNIST behaves just like ImageNet.
Importantly, they don't find any evidence for a "sharp/flat optima" effect whereby better optimization leads to worse final results. They have a good discussion of experimental artifacts/confounds in past papers where such effects were reported.
The time-to-target-validation is explained purely by optimization considerations. There's a regime where variance dominates, and you get linear speedups w/ batch size. Then there's a regime where curvature dominates and larger batches don't help. As theory would predict.
Incidentally, this paper must have been absurdly expensive, even by Google's standards. Doing careful empirical work on optimizers requires many, many runs of the algorithm. (I think surprising phenomena on ImageNet are often due to the difficulty of running proper experiments.)

Most Liked Replies

JP Raymond:
Nice, but training steps for different batch sizes don’t take take the same time, right?
Daniel Roy:
I'm very curious to look at this paper, because, from the sounds of it, they studied the problem of overfitting on the validation set. I wouldn't expect conclusions on that to relate to generalization. Also: poor optimization can induce generalization.
Saleh Elmohamed:
Really nice work to go over. Thanks Roger
for the interesting review. Their use of data parallelism +
the SGD variants particularly those without/with
Nesterov momentum (optimization-wise) caught my
interest so reviewed it shortly after it appeared at
the arXiv.
David Page:
Since variance comes largely from choice of minibatch in a finite dataset, the size of the variance dominated regime should grow with dataset size, all things equal. They see this in one ImageNet experiment but puzzlingly not the other.
brendan o'connor:
interesting, thanks for writing up
David Page:
I remain surprised by their fig 17b which seems to show training half of ImageNet classes to 30% error takes as long as training all classes to 25% error. Any insight?

Original Tweet