Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Fully supervised learning for computer vision tasks: rare semantic classes

This site may earn commission on affiliate links.
Just wanted to jot this down while I still remember it. Here’s an important clarification about fully supervised learning for computer vision tasks (one of the five pillars of Tesla’s large-scale fleet learning approach). With object detection, for example, fully supervised learning means that human annotators label images or videos by hand.

The clarification: when we’re talking about training data in this context, I think we should distinguish between total training examples (i.e. hand-labelled images or video clips) and unique training examples per semantic class. (A semantic class is a category like “car”, “truck”, “deer”, “tree”, etc.)

No company is going to get 1,000x more hand-labelled examples of cars than any other company. That would require spending 1,000x more money on labelling. But a company like Tesla that drives ~1,000x more miles than its competitors might get 1,000x more hand-labelled examples for rare semantic classes like bears, moose, overturned cars, fires, and so on. At least we can say: the more miles a fleet drives, the higher the probability that it will encounter a member of a rare semantic class. The more encounters it has, the higher the probability that it will take a camera snapshot of that rare object. So, more miles will result in more unique training examples per semantic class for rare semantic classes. Without spending proportionately more money on labelling.

This isn’t about obtaining a much higher number of total training examples. It’s about curating a much better dataset.

In the words of Kyle Vogt, the CTO of Cruise: “The reason we want lots of data and lots of driving is to try to maximize the entropy and diversity of the datasets we have.” (Entropy explained.)

In the words of Andrej Karpathy:

ssy8omd.jpg
 
The other part of this equation is the problem around being able to ask the fleet for such examples. If the neural net can’t make accurate predictions of whether a frame contains a bear— how can you ask the fleet to send you examples of bears? Tesla don’t collect every single image captured by every car so tooling to distinguish what is noise and what is important to train on is extremely important.

This is where presumably Tesla also have a huge advantage in unsupervised learning. Karpathy only briefly mentions it, but he basically says that they are able to create clustering models of similar images and then send a request out to the fleet for examples to be returned. This is not an insignificant amount of effort and something auto makers haven’t even started.

It will also be critical on the Dojo program and operation vacation. This virtuous cycle will allow Tesla to surgically pick out examples for training from the fleet and then automate the entire process of training, evaluation, deployment and iteration.
 
  • Helpful
Reactions: strangecosmos2
The other part of this equation is the problem around being able to ask the fleet for such examples. If the neural net can’t make accurate predictions of whether a frame contains a bear— how can you ask the fleet to send you examples of bears? Tesla don’t collect every single image captured by every car so tooling to distinguish what is noise and what is important to train on is extremely important.

This is where presumably Tesla also have a huge advantage in unsupervised learning. Karpathy only briefly mentions it, but he basically says that they are able to create clustering models of similar images and then send a request out to the fleet for examples to be returned.

Great points. Some other ideas for when camera snapshots might be uploaded:
  • Human interventions (e.g. Autopilot disengagements, when a human stops Smart Summon)
  • When the planner in Autopilot is running passively and outputs a low probability for the trajectory the human is taking (i.e. it is “surprised by” or “disagrees with” the human's driving)
  • Automatic object discovery
  • Novelty detection
  • Uncertainty estimation
  • Manually designed triggers, e.g. upload a snapshot when the steering wheel turns by more than X degrees within Y seconds (which might capture a human swerving for an obstacle in the road)
Anything I didn't think of?

Everyone's probably already familiar with this from Autonomy Day, but here's the clip of Karpathy talking about fleet learning for object detection: