Yeah, seems really doubtful and non optimal but who knows. My understanding is training is a totally inappropriate task for (widely) distributed compute. Can read about the “advantages” of Dojo with massive memory bandwidth and fast IPC. Maybe they’ll work miracles but seems like a potentially fundamental limitation. I know nothing.
Just weird they are building a data center with a bunch of HW4 in it. Followed by Ai5 as Elon claims.
If they're using copies of the on-board hardware then that's probably an inference only phase. E.g. they research and train on large nVidia based supercomputers with strong interprocessor very low latency hardware and all the tricks that nVidia has brought to bear over 20 years. Then there's an inevitable sparsification and quantization compression of the models to fit on car hardware, then they will test/score them on realistic hardware but fed in with real and simulated sensor recordings.
The evaluation is intrinsically highly parallel---keep model the same and feed in video clips and record output---and doesn't have need for strong interprocessor commmunication, unlike training, so the limitations of the on-board hardware are not as substantial.
Of course there's a test set inside the main nvidia loop but that would be a test set on the ML loss function that's gradient differentiable, which isn't the ultimate desired goal ("does model drive the way we want to, but safely").
The idea seems sound---be able to train many models and change architectures, and then evaluate whether those human-scale changes (experiments the humans decide upon) are improvements or degradations on more realistic situations before pushign to actual cars and getting more stochastic and less precise feedback.
The nVidia computers will under higher demand and will be more expensive so pushing off less critical work off them in favor of more training cycles is a win. On the large scale there are always more potential experiments the modelers want to run than there is compute resources for.