Thanks Jimmy. Gives more context to this leaked email from a few months ago: Any wild guesses what impact that 10x processing + everything resolution might have?
Well, as of right now the neutral networks are not trained for signage recognition. Elon said in the Q2 investor call that will require HW3. So maybe they will run different/additional NNs for added functionality rather than just using bigger and bigger training sets. Like, path planning can be the same for Europe and the US but different training sets for the different signage?
Fantastic writeup, thank you for doing this! I'd also add that this is probably more than just a "fringe" benefit: swapping to/from the limited amount of 4-8GB of GPU memory can be very expensive: easily an order of magnitude slower than keeping the data within GPU RAM. Switching NN weights in the GPU is roughly equivalent to a 3D rendering pipeline "switching state" or "swapping textures": they involve not just a lot of data transfer over the comparably narrow and single channel memory bus interface to host RAM, but they also require a "batch flush": the thousands of parallel threads of computing have to be waited for to finish - and the switch cannot be done until the slowest one has finished. With a unified, camera agnostic NN weight file all the computing units can use the parameters all the time and the computing of different frames can be freely intermixed to maximize the utilization of the vector CPUs. Note that beyond residency and good batching a unified weight file has a third benefit: it increases the hit rate (efficiency) of the Nvidia L1/texture hardware caches. If all frames use the same weights then frames from different cameras might still share the same high level weights in the computing unit's hardware cache. (A fourth, minor benefit is that if the GPU uses the system bus much less then high level computing on the host CPU will be a bit faster as well, because main RAM has lower utilization. This should improve performance and determinism of the vehicle control algorithms running on the host CPUs. ) So a fully resident, camera agnostic, unified NN weight state in GPU RAM is a Big Falcon Deal, IMHO ...
wouldn't be the case that the display gets input at some slower rate to reduce latency of important segments and the autos are drawn with some "predictor-corrector" algorithm, i.e. it is just "simulation" a model guided by sensor system with some arbitrary precision? In any case it is hard to imagine that Tesla doesn't use fuzzy control arithmetic, And it's use means that even if detection scheme would have significant ghosting it doesn't meant it should translate into car sensitivity and unstable driving.
Wow, nice writeup. I assume the pictures can be transformed before being labelled to look(to the labeller) as they were taken with the same camera and then labels can be transformed back. Also pictures can be transformed to augment the dataset. Maybe they have some simulation environment with a GAN that makes pictures look very real, then they can easily get a huge amount of very correct labels. And a few GANs to change daytime into night time etc. Just brainstorming how you can get a few magnitudes more data.
Thanks for the very informative analysis. During our recent 900 mile trip having V9 only recently installed there seemed to me to be a lot of welcomed changes. The road we travel is not smooth and has some pavement to bridge section elevation irregularities. In the past the sensor data acquisition signal would get lost. This did not happen at all during our trip!!! Migration of the car toward the exit ramp or entrance transition is better-it doesn't swerve as drastically in order to acquire the right most pavement line. States seem to be marking entrance ramp openings to the highways with broken white lines, and the car doesn't swerve when these are present. State departments of transportation seem to be marking these more reliably ow they need to begin marking entrance to highway transitions. Thanks for the NN analysis.
I've told several friends that AP on V9 just "feels" enormously more sophisticated than V8, this explains it.
Warning: I’m highly unqualified to even be attempting to discuss this, but here goes: I was thinking the other day while driving what a tremendous undertaking it would be to have to have different NNs for each of the “newly enabled” cameras, given the change in perspective and shape that each (minus the 2 pairs) is collecting and having to work with. But then it occurred to me that one of my hobbies that I DO know a little bit about is photography, and in that sense, wouldn’t they just be able to do the equivalent of “lens correction” like I do in post process (Lightroom/Photoshop) to account for the various tendencies of different lenses (like barrel/pincushion distortion) ? In other words, I understand that “car” looks wildly different from the perspective of the fisheye vs front vs side repeater vs rear camera, but if you could “normalize” the captured gram across all of them by mapping and correcting the lens distortion, wouldn’t that allow them all to use the same NN across all cameras? (Again, this might be exactly what you’re describing or I’m completely misunderstanding! Also, huge thanks for writing these - It’s obviously it takes a lot of time and effort, but the community is hugely appreciative!)
Thanks for this @jimmy_d, I'm wondering if it's possible Tesla's leveraging the processing power deployed in each car to do this training? I mean they are building and deploying on the order of 10k units a week into the field... I'm a NN newb though, not sure if this makes sense at all or possible to do this training decentralized. Appreciate any insights!
Tesla’s fleet is like ‘a large, distributed, mobile data center’, says Tesla’s new AI director Just think of it....
Looks like someone cross-posted to Reddit a few hours ago. So they're talking about it over there now too. Amazing post on big Neural Network Changes in V9 - Potentially and order of magnitude or more better : teslamotors Thanks for the writeup @jimmy_d!
Thanks for the amazing write up!!!! Would you say that this monster NN in V9 is the NN for "FSD"? I am not saying that the cars have FSD but it would seem like this new monster NN would be the NN that FSD will eventually use. The fact that it is so big and uses all 8 cameras, would seem like a good indication that FSD will use the V9 NN. If so, that is pretty exciting if with V9, we have switched from the "strictly EAP" NN to a "FSD capable" NN. Or do you think this just a "pre-FSD" NN and we will get yet another new NN when V10 comes out that will be the "FSD" NN?
I’ve tried in the past. Not enough karma so it won’t let me post. Feel free to copy it over if you think there’s a community that would find it interesting.
Yes, this is possible and widely done. V8 had an undistort function in the binary and indications that stereo processing was happening with main and narrow. This can be done with non NN techniques if undistort and raster alignment are used and I think that might well be going on in V8. But there are limits to how much undistort can help NNs because it introduces problems of its own. There’s an argument that just letting the NN learn the distortion along with everything else is the best solution if you have enough training data. So I don’t know if your conjecture applies but it’s certainly worth investigating. I’ll be looking for signs of it in the binaries when I have some time.
To me, this is the most intriguing thing you said. Can you roughly quantify for us how much training data you think is needed?
I was surprised to read that Tesla is not using 4K cameras, but then again, there's a trade off between the amount of detail that can be seen vs processing power required...