Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

FSD tweets

This site may earn commission on affiliate links.
Yeah, seems really doubtful and non optimal but who knows. My understanding is training is a totally inappropriate task for (widely) distributed compute. Can read about the “advantages” of Dojo with massive memory bandwidth and fast IPC. Maybe they’ll work miracles but seems like a potentially fundamental limitation. I know nothing.

Just weird they are building a data center with a bunch of HW4 in it. Followed by Ai5 as Elon claims.

If they're using copies of the on-board hardware then that's probably an inference only phase. E.g. they research and train on large nVidia based supercomputers with strong interprocessor very low latency hardware and all the tricks that nVidia has brought to bear over 20 years. Then there's an inevitable sparsification and quantization compression of the models to fit on car hardware, then they will test/score them on realistic hardware but fed in with real and simulated sensor recordings.

The evaluation is intrinsically highly parallel---keep model the same and feed in video clips and record output---and doesn't have need for strong interprocessor commmunication, unlike training, so the limitations of the on-board hardware are not as substantial.

Of course there's a test set inside the main nvidia loop but that would be a test set on the ML loss function that's gradient differentiable, which isn't the ultimate desired goal ("does model drive the way we want to, but safely").

The idea seems sound---be able to train many models and change architectures, and then evaluate whether those human-scale changes (experiments the humans decide upon) are improvements or degradations on more realistic situations before pushign to actual cars and getting more stochastic and less precise feedback.

The nVidia computers will under higher demand and will be more expensive so pushing off less critical work off them in favor of more training cycles is a win. On the large scale there are always more potential experiments the modelers want to run than there is compute resources for.
 
  • Like
Reactions: AlanSubie4Life
The evaluation is intrinsically highly parallel---keep model the same and feed in video clips and record output---and doesn't have need for strong interprocessor commmunication, unlike training, so the limitations of the on-board hardware are not as substantial.
Sure.

I can understand all this need to evaluate models in a data center, but seems pointlessly complex to distribute it, especially if you… have a bunch of HW4/Ai5 in a data center.

So distributed computing seems to make little sense. (You weren’t suggesting that but it is another topic here.)

What you point out does make some sense of Elon’s Tweet though. Seems like a very small part of the power footprint of that datacenter!

Sounds like they may be finally getting around to simulating! About time!
 
Sure.

I can understand all this need to evaluate models in a data center, but seems pointlessly complex to distribute it, especially if you… have a bunch of HW4/Ai5 in a data center.

So distributed computing seems to make little sense. (You weren’t suggesting that but it is another topic here.)
I agree entirely. Using customer's owned car hardware for this task is lots of work for little gain and much opportunity for public backlash. Like Oscar Meyer using customer blenders to make hot dogs. The labor of high end developers needed to make that work is much better spent making their in-house training IT reliable and available. Even if the inference hardware doesn't need strong interprocess communication at the ML model level, it does need high-throughput video streaming to it for scoring. It would run videos with zero wait, so a bit faster than real time (next video frame is always available when computation has ended). The IO system would have to be substantial to feed thousands of processors all real-time video, gather results and collate. Not easily done with end-users on a weak wi-fi signal.

On the training side it may be that the video perception (highest quantity of bits needed to move) part is only retrained occasionally (every few months) given its computational burden, but there's probably a desire to retrain and adapt the neural policy models much more frequently as this is where the problems seem to lie today (though possibly some policy errors were in fact perception errors we don't know about). But the results of a driving policy model are probably more expensive to evaluate than the perception---where they already have enough annotated videos and bounding boxes of objects.

What you point out does make some sense of Elon’s Tweet though. Seems like a very small part of the power footprint of that datacenter!

Sounds like they may be finally getting around to simulating! About time!
They've been making simulations for a long time. This is where the new vision transformers and new ML discoveries in the field are likely to be helpful, making more realistic simulation data than traditional 3-d animation CGI would.
 
1719227835773.png


Tesla & "FSD does not make your car autonomous."

Also Tesla: "Use FSD while intoxicated."
 
Shouldn't they drop 12.4 and just go with 12.5?
My impression is that they're experimenting with different formulations of the neural networks in each minor version. They throw a 12.4 at us, collect data on it, then gain understanding of which bits of that formulation work and which don't. Then they move on to 12.5, which will have different strengths and weaknesses. There may be a 12.6 that they've thought up as well. Ideally, they accumulate enough knowledge to make an acceptable product.

If true, these versions aren't sequential improvements to a basic set of functionality so much as sequential experiments on the same basic architecture.
 
  • Like
Reactions: VanFriscia