I ended up staying up pretty late watching the whole thing last night. I work in a similar field of research (not self-driving, but computer-vision/object-detection/segmentation/tracking, etc) and really enjoyed the talk. Here are just some general thoughts and notes I had as I watched it:
1. I've always admired Andrej Karpathy as a very pragmatic, no-BS ML researcher and I continue to be glad that he is leading this effort. Not to take anything away from the entire team, but the overall direction and approach they have taken, not just in the sense of where they are going with their ML approaches, but also the spectacular amounts of engineering they have done to build up the right tooling and infrastructure to help them iterate fast is quite something.
2. The evolution of their approach to their networks, architecture and training approaches are all quite sensible but they are extremely complex and require ridiculous amount of data to train which is why Tesla has built out all the amazing infrastructure on the labeling/reconstruction side of things, as well as with simulations.
3. Their approach to spatial and temporal memory is a nice and strongly needed upgrade, and the performance improvement in the demos were very nice to see.
4. The policy learning section was really neat and I'm sure there is a lot more neat stuff under the hood.
5. Very excited to see where they have gone with simulations. The imagery looks quite impressive. It is very tricky because if your visuals don't look near-identical to the real-world, the networks are so powerful that they can just learn to do well on synthetic imagery and then fall apart on real-world imagery. This is a common challenge with folding in synthetic data from simulations, so it was nice to see the extent of work they have done and continue to do with exploring neural rendering to try and bridge the gap between simulation and reality
- I did note that the vehicle dynamics of cars/trucks, etc in the simulations were fairly non existent in terms of body-roll, etc. But that's a 2nd order concern at the moment and unlikely to matter too much for what they are trying to do
6. It's clear that they are starting to hit the limits of being able to fit all the inference for these complex networks onto their current FSD hardware. In the diagram they showed in one of the slides, it's clear that they are now leveraging both compute units to try and squeeze the needed performance (rather than the original vision of having the 2nd unit for redundancy). I wonder how they are going to handle the inevitable fragmentation when they release new hardware and now have to support two separate sets of FSD/AP and associated networks when that happens because they certainly aren't going to foot the bill for upgrading their entire fleet.
7. The amount of crazy levels of custom engineering that has gone into all aspects of their entire pipeling... from labeling tools and infrastructure, training networks, simulations and regression testing their models is quite spectacular.
8. Their offline processing of videos to generate ground-truth 3D point clouds and other assets + autolabeling is extremely impressive. Anyone who works in that kind of field probably really appreciated how neat it was even more.
9. They've definitely made the right decisions early on in how they set up the closed loop between experiments they want to run or data they want to collect and their vehicle fleet. Tying hard/weird examples in with simulations to generate a large number of variants of a single instance of a real-world "weird" scenario is a logical extension and really neat to see because this helps try to address the problem with the long-tail of weird scenarios. It still doesn't fundamentally solve this because the long-tail in the real-world is really long but with time this will still improve the overall FSD abilities to handle weird scenarios
10. I feel like they were a bit disingenuous with their Dojo presentation. There was very little distinction made between what they have built, tested and benchmarked to-date vs. what was aspirational with regards to Dojo. They only just got one working D1 on a benchtop that they managed to train a small GPT model on, but the slides would have you believe they have this huge room-sized cluster almost built up and ready to go. During the Q&A session a researcher on compilers for distributed computing systems asked a question about whether Tesla had managed to solve a very hard and challenging problem with such systems that is an active area of research in academia and the reply from the Tesla counterpart was very wishy washy, basically saying no, it's hard, but we think we can solve it. How Dojo eventually shakes out is still a big unknown at this point.
11. While the inner-geek in my loved the Dojo stuff, and I'm sure anyone working on computing hardware would absolutely love to see someone spending time and a lot of resources on developing a whole new architecture, I have to say that taking a step back and looking past all the hype, it still isn't clear to me if investing all of this effort into Dojo is really all that beneficial to Tesla. It feels more like someone said this would be cool if we did this, and they just ran with it. At the end of the day, even if they meet all their goals for performance and efficiency, it's not like Dojo is going to let them train fundamentally new types of networks that are impossible to train on any other hardware. The Google's, Facebooks, OpenAIs of the world are doing just fine training extraordinarily complex ML systems using conventional GPU clusters relying on nVidia GPUs/TPUs. Even Tesla to date has managed to do everything they have achieved so far on such clusters. All Dojo will let them do is train things faster by some amount, and at a lower cost. It isn't a fundamental game-changer in any aspect that opens up fundamentally new directions and opportunities for Tesla. I wonder if the company wouldn't have been better served building out a regular GPU computing cluster and saving all these resources and allocating it elsewhere. Still curious to see how it shakes out, but it still seems like a bit of a gamble to me with very questionable long-term benefit.
All in all, it was a neat presentation and everything looks to be going in a better direction and it certainly doesn't look like Tesla is just stuck spinning wheels. I'm still very skeptical on them getting anywhere near L5 anytime soon, but I do feel fairly confident that they will have a extremely solid AP even in city conditions with the progress they are making. I still do wish they just made the plunge with augmenting with automotive Lidar, but I guess that's not going to happen due to the optics of switching up their sensor suite so late in the game and the fact that they have sold FSD to so many customers already with the current camera suite.