At Autonomy Day, Elon said: “The car is an inference-optimized computer. We do have a major program at Tesla — which we don’t have enough time to talk about today — called Dojo. That’s a super powerful training computer. The goal of Dojo will be to be able to take in vast amounts of data — at a video level — and do unsupervised [a.k.a. self-supervised] massive training of vast amounts of video with the Dojo computer. But that’s for another day.” Dojo is for self-supervised learning on video and, therefore, presumably for computer vision. DeepMind recently demonstrated that you can get better performance on image recognition with 2x to 5x fewer hand-labelled training images when you do self-supervised pre-training beforehand. Yann LeCun also recently shared his prediction for self-supervised learning on video in 2020. LeCun helped pioneer the field of deep learning and won a Turing Award for that. He's also a computer science professor at NYU and Chief AI Scientist at Facebook. LeCun wrote: “This suggests that the way forward in AI is what I call self-supervised learning. It’s similar to supervised learning, but instead of training the system to map data examples to a classification, we mask some examples and ask the machine to predict the missing pieces. For instance, we might mask some frames of a video and train the machine to fill in the blanks based on the remaining frames.” Here's a short clip of LeCun explaining this idea: The full talk is worth watching. LeCun continues: “This approach has been extremely successful lately in natural language understanding. Models such as BERT, RoBERTa, XLNet, and XLM are trained in a self-supervised manner to predict words missing from a text. Such systems hold records in all the major natural language benchmarks. In 2020, I expect self-supervised methods to learn features of video and images. Could there be a similar revolution in high-dimensional continuous data like video? One critical challenge is dealing with uncertainty. Models like BERT can’t tell if a missing word in a sentence is “cat” or “dog,” but they can produce a probability distribution vector. We don’t have a good model of probability distributions for images or video frames. But recent research is coming so close that we’re likely to find it soon. Suddenly we’ll get really good performance predicting actions in videos with very few training samples, where it wasn’t possible before. That would make the coming year a very exciting time in AI.” Now I feel like I understand the purpose of Dojo. There is a research hurdle to clear that Dojo won't solve: representing uncertainty in video prediction. But if it is solved in 2020 as LeCun predicts, then the main constraints on self-supervised pre-training will be data and compute. Tesla has access to plenty of video data, which is cheap to record, upload, and store. It can also use active learning to select which video clips to upload. Then Dojo is intended to provide 10x more useful training compute at a lower cost. There's also the constraint of inference compute. Can HW3 run big enough neural networks in real time? I guess we'll see. DeepScale is supposed to squeeze down neural networks to fit on HW3 and HW4 is already in the works and it's supposed to be 3x more powerful. Maybe with the breakthrough in self-supervised video prediction that LeCun anticipates, adopted by Tesla and accelerated by Dojo, we'll see an order of magnitude increase in Tesla's computer vision performance.