Elon has just replied to Lex Fridman tweeting about his thoughts on AI Day saying "Summarized well". Here is the text of Lex's AI Day reaction video:
Lex Fridman
Tesla AI Day presented the most amazing real-world AI and engineering effort I have ever seen in my life. I wrote this, and I meant it.
Why was it amazing to me? No, not primarily because of the Tesla Bot.
It was amazing because:
- I believe the autonomous driving task, and the general, real-world, robotics perception and planning task, is a lot harder than people generally think, and I also believed that
- The scale of effort in algorithm, data, annotation, simulation, inference compute and training compute required to solve these problems was something no one would be able to do in the near-term.
- Yesterday was the first time I saw in one place just the kind and the scale of effort that is a chance to solve this, the autonomous driving problem, and the general, real-world, robotics perception and planning problem.
- This includes:
- The neural network architecture and pipeline,
- The autopilot compute hardware in the car,
- Dojo compute hardware for training,
- The data, and the annotation,
- The simulation for rare edge-cases, and, yes,
- The generalised application of all of the above, beyond the car robot, to the humanoid form.
Let’s go through the big innovations:
The neural network:
- Each of these is a difficult, and I would say brilliant design idea, that is either a step- or a leap-forward from the state of the art in machine learning.
- First is to predict in vector-space, not in image-space. This alone is a big leap beyond what is usually done in computer vision, that usually operates in the image-space, in the 2-dimensional image.
- The thing about reality is that it happens out there in the 3-dimensional world, and it doesn’t make sense to be doing all the machine learning on the 2-d projections of it on to images. Like many good ideas, this is an obvious one, but a very difficult one.
- Second is a fusion of camera sensor data before the detections (the detections perfomed by the different heads of the multi-task neural network). For now the fusion is at the multi-scale feature level.
- Again, in retrospect, an obvious but a very difficult engineering step, of doing the detection and the machine learning on all of the sensors combined, as opposed to doing them individually and combining all the decisions.
- Third is using video context to model not just vector-space, but time. At each frame, concatenating positional encodings, multi-cam features, and ego kinematics, using a pretty cool spatial recurring neural network architecture, that forms a 2-d grid around the car where each cell of the grid is a RNN (recurring neural network).
- The other cool aspect of this is that you can then build a map in the space of RNN features, and then do planning in that space, which is a fascinating concept.
- Andrej Karpathy, I think, also mentioned some future improvements, performing the fusion earlier in the neural network. Currently the fusion of space and time are late in the network. Moving the fusion earlier on takes us further toward full, end-to-end driving with multiple modalities, seamlessly fusing – integrating – the multiple sources of sensory data.
- Finally, the place where there is currently – from my understanding – the least amount of utilisation of neural networks is planning. So, obviously optimal planning in exospace (?) is intractable, so that you have to come up with a bunch of heuristics. You can do those manually, or you can do those through learning. So the idea that was presented was to use neural networks as heuristics, in a similar way that neural networks were used as heuristics in the Monte Carlo tree search for Mu-Zero and AlphaZero to different games, to play Go, to play chess. This allows you to significantly improve on the search through action space, for a plan that doesn’t get stuck in the local optima and gets pretty close to the global optimum.
- I really appreciated that the presentation didn’t dumb anything down, but maybe in all the technical details it was easy to miss just how much brilliant innovation there was here.
- The move to predicting in vector-space is truly brilliant. Of course you can only do that if you have the data, and you have the annotation for it, but just to take that step is already taking a step outside the box of the way things are currently done in computer vision. Then fusing seamlessly across many camera sensors. Incorporating time into the whole thing in a way that’s differentiable with these spatial RNNs. And then of course using that beautiful mess of features, both on the individual image side, and the RNN side, to make plans, using neural network architecture as a heuristic, I mean all of that is just brilliant.
- The other critical part of making all of this work is the data and the data annotation.
- First, is the manual labelling. So to make the neural networks that predict in vector space work, you have to label in vector-space. So you have to create in-house tools, and as it turned out, Tesla hired an in-house team of annotators to use those tools, to then perform the labelling in vector-space, and then project it out into the image-space. First of all, that saves a lot of work, then second of all, that means you’re directly performing the annotation in the space in which you are doing the prediction.
- Obviously, as was always the case, as is the case with self-supervised learning, auto-labelling is the key to this whole thing. One of the interesting things that was presented was the use of clips of data: that includes video, IMU, GPS, odometry and so on, for multiple vehicles in the same location and time, to generate labels of both the static world and the moving objects and their kinematics. That’s really cool, you have these little clips, these buckets of data from different vehicles, and they’re kind of annotating each other. You’re registering them together to then combine a solid annotation of that particular part of road at a particular time. That’s amazing because the more the fleet grows, the stronger that kind of auto-labelling becomes, and the more edge-cases you are able to catch that way
Speaking of edge-cases, that’s what Tesla is using simulation for, is to simulate rare edge-cases that are not going to appear often in the data, even when that data set grows incredibly large.
And also, they are using it for annotation of ultra-complex scenes where accurate labelling of real-world data is basically impossible, like a scene with a hundred pedestrians, which is I think the example they used. So I honestly think the innovations on the neural network architecture and the data annotation is really just a big leap.
Then there’s the continued innovation on the autopilot computer side.
- The neural network compiler that optimises latency, and so on.
- There’s, uh, I think I remember really nice testing and debugging tools, for variance of candidate-trained neural networks to be deployed in the future, or you can compare different neural networks together. That’s almost like developer tools for to-be-deployed neural networks.
- And it was mentioned that almost ten thousand GPUs are currently being used to continually retrain the network. I forget what the number was but I think every week or every two weeks the network is fully retrained, end-to-end.
The other really big innovation, but unlike the neural network and the data annotation this is in the future, so to-be-deployed still, it’s still under development – is the Dojo computer, which is used for training.
- So the Autopilot computer is the computer on the car that is doing the inference, and the Dojo computer is the thing that you would have in the data centre, that performs the training of the neural network.
- There’s a – what they’re calling a single training tile – that is nine petaflops (laughing). It’s made up of D1 chips that are built in-house by Tesla. Each chip with super-fast I/O, each tile also with super-fast I/O, so you can basically connect an arbitrary number of these together, each with a power supply and cooling.
- And then I think they connected a million nodes, to have a compute centre. I forget what the name is, but it’s 1.1 exoflops. So combined with the fact that this can arbitrarily scale, this is basically contending to be the world’s most powerful neural network computer.
- Again, the entire picture that was presented on AI Day was amazing, because the – what would you call it? – the Tesla AI Machine can improve arbitrarily through the iterative data engine process of auto-labelling plus manual labelling of edge-cases – so the labelling stage, plus data collection, re-training, deploying. And again you go back to the data collection, the labelling, re-training, deploying. And you can go through this loop as many times as you want to arbitrarily improve the performance of the network.
I still think nobody knows how difficult the autonomous driving problem is, but I also think this loop does not have a ceiling. I still think there’s a big place for driver sensing, I still think you have to solve the human-robot interaction problem to make the experience more pleasant, but dammit (laughing) this loop of manual and auto-labelling that leads to re-training, that leads to deployment, that goes back to the data collection and the auto-labelling and the manual labelling is incredible.
- Second reason this whole effort is amazing is that Dojo can essentially become an AI training as a service, directly taking on Amazon Web Services and Google Cloud. There’s no reason it needs to be utilised specifically for the Autopilot computer. The simplicity (laughing) of the way they described the deployment of PyTorch across these nodes – you could basically use it for any kind of machine learning problem. Especially one that requires scale.
- Finally the third reason all of this was amazing is that the neural network architecture and data engine pipeline is applicable to much more than just roads and driving. It can be used in the home, in the factory, and by robots of basically any form, as long as it has cameras and actuators, including, yes, the humanoid form.
As someone who loves robotics, the presentation of a humanoid Tesla Bot was truly exciting. Of course, for me personally, the lifelong dream has been to build the mind, the robot, that becomes a friend and companion to humans, not just a servant that performs boring and dangerous tasks. But to me these two problems should, and I think, will be solved in parallel.
The Tesla Bot, if successful, just might solve the latter problem, of perception and movement and object manipulation. And I hope to play a small part in solving the former problem, of human-robot interaction, and yes, friendship. I’m not going to mention love when talking about robots. Either way, all this to me paints an exciting future. Thanks for watching. Hope to see you next time.