This is what Karpathy tweeted earlier today
“lots of exciting recent work in large-scale distributed training of neural nets: (very) large-batch SGD, KFAC, ES, population-based training / ENAS, (online) distillation, ... ”
Maybe @jimmy_d can explain that to the mere mortals like myself.
From twitter comments:
----
If you have a batch of 64, the most (trivial) parallelism you can get us to break the batch up to 64 processors. If you have a batch of 12k, you can now use 12k nodes. Course parallel tends to scale best too, so you get closer to linear speed up with nodes.
----
Basically faster processing, which results in faster network training in theory.