jimmy_d described his theory about Dojo
in this post:
“This is an opinion, but I’m going to state it emphatically because I’m pretty confident in it.
Dojo isn’t going to be a training computer deployed into car, it’s going to be training infrastructure that is optimized to perform unsupervised learning from video at scale. Tesla is probably going to produce custom silicon to enable this because available commercial hardware is inadequate to the task, but it should be doable with a level of silicon development effort comparable to what it took to create Tesla’s FSD chip.”
This theory lines up with things Elon and Karpathy have said and it's what makes the most sense to me. The purpose of Dojo is likely to speed up
self-supervised learning (a.k.a. unsupervised learning) for computer vision tasks. Whereas
supervised learning uses deliberate human labels as the supervisory signal or training signal, in self-supervised learning, the supervisory signal comes from the data itself. This
blog post has a bunch of examples.
This is what Elon said at Autonomy Day (at
2:26:42):
“The car is an inference-optimized computer. We do have a major program at Tesla — which we don’t have enough time to talk about today — called Dojo. That’s a super powerful training computer. The goal of Dojo will be to be able to take in vast amounts of data — at a video level — and do unsupervised massive training of vast amounts of video with the Dojo computer. But that’s for another day.”
In July, Karpathy had
a tweet — not directly related to Tesla, but still suggestive — on self-supervised learning:
“(The “correct” area of research to watch closely is stupid large self-supervised learning or anything that finetunes on/distills from that. Other “shortcut” solutions prevalent today, while useful, are evolutionary dead ends)”
Tesla already uses self-supervised learning to
predict the behaviour of road users. On Autonomy Day, Karpathy said Tesla has explored a self-supervised technique for at least one computer vision task:
depth mapping.
For many tasks, I would imagine the goal with self-supervised learning would be to supplement or bootstrap supervised learning rather than replace it entirely. In other words, the goal would be to allow supervised learning to get better results with the same amount of manually labelled training data.
Here's one way I think that could work. (I'm new to this topic so I could be getting it wrong.) The neural network trains on a proxy task or “pretext task” like predicting/generating future frames of video from previous frames of video. In so doing, the network learns latent or implicit representations or concepts of objects like vehicles, pedestrians, cyclists, lane lines, road edges, curbs, traffic lights, stop signs, and so on. When it comes time to do supervised learning with manually labelled video frames, the neural network learns these objects categories faster and better because it already has rich concepts of them from its self-supervised training.
Here's a great talk from deep learning pioneer and Turing Award winner
Yann LeCun on self-supervised learning:
Weakly supervised learning is another approach that would allow Tesla to train neural networks on computer vision tasks without manually labelling data. (I recently wrote about weakly supervised learning
in this article.) In an autonomous driving context, weakly supervised learning uses human driving behaviour as a source of automatic labels for camera data. This approach has been shown to work well for
semantic segmentation of free space. Weakly supervised learning is also what Tesla has been using to
predict the curvature and gradient of roadways.
I would venture to speculate that, as with self-supervised learning, in order to fully exploit weakly supervised learning, Tesla needs to train neural networks on “stupid large” video datasets — and that's hardware-intensive.
I believe self-supervised learning and weakly supervised learning for computer vision are two pillars of Tesla's
large-scale fleet learning approach.