Elon tweeted that it will take another year before DOJO V1.0 will be ready. I thought that project was essential for software 2.0 and FSD. Software 2.0 will be released in about 10 weeks (as per tweet Elon) so before DOJO is operational. I thought I had a grasp of everything, but apparently not. Can someone please explain what DOJO has to do with FSD and why FSD will be released before DOJO is operational?
Supervised Learning
The "conventional" machine learning technique for prediction is
supervised learning. For the perception task, this would be:
1) Feed in a whole bunch of video/image sequences from many sources.
2) Manually label all the objects in the sequences you want to track.
3) Build an architecture that takes in all the sequences and makes a prediction on all the objects.
4) Compare those predictions with the actual labels. Use the error between prediction and truth to adjust weights in the neural net (backpropagation) with the goal to reduce the error the next round.
5) When the error rate ceases to continue reducing, basically you stop training it further, and you have your model.
6) Take the model and stick it into HW3.0 where it does the
inference step only (make predictions).
This process can be easily rate-limited. If your inference hardware relatively sucks (ahem HW 2.0/2.5) then you can't even handle a big enough model to feed in image sequences, only single images at a time (and downsampled image resolution at that). So the model architecture you can use is limited.
So now Tesla has HW 3.0 running, great so what's the next rate-limiter? Well it could be the amount of compute training capacity, but I haven't read anywhere that this is the case. But the big limiter is
manually labeling all the data. This is literally hiring people to go through data sequences and labeling objects (e.g. draw the ROI (region of interest) around it, selecting what type of object it is). Imagine how much data Tesla can pull, but then it all has to be labeled! Just does not scale.
It sounds like Tesla was labeling each image individually. For some key event over a few seconds, they have to keep labeling new image. Obviously this is quite monotonous and takes a lot of time.
With "4D" psuedo lidar, I am guessing they are using one NNet to output some processed version of the video sequence (psuedo lidar) so that objects have
permanence (it is known that an objects in one frame is the same object that was there earlier). So now labelers can label objects just once per video sequence.
This will probably 10x the speed of labeling data. So now Tesla has updated the architecture (#3) to input video sequences instead of single images, and they can have more labeled data more quickly to train it. So the accuracy of the models should improve a good mount. Great!
But that is not a long term solution. With more and more cars on the road, Tesla's need to label the data is not going to happen with people manually labeling. Does not scale.
Semi-supervised Learning
Instead, Tesla wants to attempt to do
semi-supervised / self supervised learning. This means Tesla uses the data itself to create the labels.
Want to avoid having people label data? Just use drivers' actually driving behaviors. This means you could have the same video sequences input, and the output of the entire model architecture is the what acceleration and turning the car should do. The labels are the
actual drivers acceleration / velocity and turning. Tesla has those. There is no manual labeling needed!
This is what is known and end-to-end deep learning. Note I speculated that Tesla would eventually go down this road 3 years ago:
[Speculation] Tesla going to use End-to-End Deep Learning to Control Vehicles
The issues are that the model is essentially a combination of modules (perception, planning, etc...) and there is no direct training on the specific modules. There is no image labels, no curb / intersection labels. Just the final output.
So this means not only are the models big, but the training is difficult because learning all the essential components indirectly from driving behavior is less efficient - but there is much more data for it.
So Dojo is needed. It will be able to process all the self-labeled data (which Tesla has in spades) but needs massive compute in order to train such a complex system.
I don't know of a bigger training system in the world that would need to input so much data.
Opinions
Will it work? I think so, long term. If Tesla had Dojo running today, I don't think they would be ready to make it work well. I really think Tesla needs to have all the modules of the eventual end to end system in a very good state first (which they can do via supervised learning). Then Dojo can leverage those "good" models and make them great. If you literally started running the unsupervised model from scratch (literally random weights for all the neural nets) I think it's ability to converge on a great solution would be compromised.
I think this is a great approach long term. I think Tesla can learn a model that can work as well as what cameras will allow.
Now, I have no idea what the final size of such a model will be, and what hardware will be required to run inference on it. It may be more than HW 3.0. I'm not sure Tesla knows either.