Roger Grosse
@RogerGrosse 5 months, 3 weeks ago 1039 views

Important paper from Google on large batch optimization. They do impressively careful experiments measuring # iterations needed to achieve target validation error at various batch sizes. The main "surprise" is the lack of surprises. [thread]


The paper is a good example of lots of elements of good experimental design. They validate their metric by showing lots of variants give consistent results. They tune hyperparamters separately for each condition, check that optimum isn't at the endpoints, and measure sensitivity.
They have separate experiments where the hold fixed # iterations and # epochs, which (as they explain) measure very different things. They avoid confounds, such as batch norm's artificial dependence between batch size and regularization strength.
When the experiments are done carefully enough, the results are remarkably consistent between different datasets and architectures. Qualitatively, MNIST behaves just like ImageNet.
Importantly, they don't find any evidence for a "sharp/flat optima" effect whereby better optimization leads to worse final results. They have a good discussion of experimental artifacts/confounds in past papers where such effects were reported.
The time-to-target-validation is explained purely by optimization considerations. There's a regime where variance dominates, and you get linear speedups w/ batch size. Then there's a regime where curvature dominates and larger batches don't help. As theory would predict.
Incidentally, this paper must have been absurdly expensive, even by Google's standards. Doing careful empirical work on optimizers requires many, many runs of the algorithm. (I think surprising phenomena on ImageNet are often due to the difficulty of running proper experiments.)

You May Also Like

So the cryptocurrency industry has basically two products, one which is relatively benign and doesn't have product market fit, and one which is malignant and does. The industry has a weird superposition of understanding this fact and (strategically?) not understanding it.

The benign product is sovereign programmable money, which is historically a niche interest of folks with a relatively clustered set of beliefs about the state, the literary merit of Snow Crash, and the utility of gold to the modern economy.

This product has narrow appeal and, accordingly, is worth about as much as everything else on a 486 sitting in someone's basement is worth.

The other product is investment scams, which have approximately the best product market fit of anything produced by humans. In no age, in no country, in no city, at no level of sophistication do people consistently say "Actually I would prefer not to get money for nothing."

This product needs the exchanges like they need oxygen, because the value of it is directly tied to having payment rails to move real currency into the ecosystem and some jurisdictional and regulatory legerdemain to stay one step ahead of the banhammer.