Just wanted to jot this down while I still remember it. Here’s an important clarification about fully supervised learning for computer vision tasks (one of the five pillars of Tesla’s large-scale fleet learning approach). With object detection, for example, fully supervised learning means that human annotators label images or videos by hand.
The clarification: when we’re talking about training data in this context, I think we should distinguish between total training examples (i.e. hand-labelled images or video clips) and unique training examples per semantic class. (A semantic class is a category like “car”, “truck”, “deer”, “tree”, etc.)
No company is going to get 1,000x more hand-labelled examples of cars than any other company. That would require spending 1,000x more money on labelling. But a company like Tesla that drives ~1,000x more miles than its competitors might get 1,000x more hand-labelled examples for rare semantic classes like bears, moose, overturned cars, fires, and so on. At least we can say: the more miles a fleet drives, the higher the probability that it will encounter a member of a rare semantic class. The more encounters it has, the higher the probability that it will take a camera snapshot of that rare object. So, more miles will result in more unique training examples per semantic class for rare semantic classes. Without spending proportionately more money on labelling.
This isn’t about obtaining a much higher number of total training examples. It’s about curating a much better dataset.
In the words of Kyle Vogt, the CTO of Cruise: “The reason we want lots of data and lots of driving is to try to maximize the entropy and diversity of the datasets we have.” (Entropy explained.)
In the words of Andrej Karpathy:
The clarification: when we’re talking about training data in this context, I think we should distinguish between total training examples (i.e. hand-labelled images or video clips) and unique training examples per semantic class. (A semantic class is a category like “car”, “truck”, “deer”, “tree”, etc.)
No company is going to get 1,000x more hand-labelled examples of cars than any other company. That would require spending 1,000x more money on labelling. But a company like Tesla that drives ~1,000x more miles than its competitors might get 1,000x more hand-labelled examples for rare semantic classes like bears, moose, overturned cars, fires, and so on. At least we can say: the more miles a fleet drives, the higher the probability that it will encounter a member of a rare semantic class. The more encounters it has, the higher the probability that it will take a camera snapshot of that rare object. So, more miles will result in more unique training examples per semantic class for rare semantic classes. Without spending proportionately more money on labelling.
This isn’t about obtaining a much higher number of total training examples. It’s about curating a much better dataset.
In the words of Kyle Vogt, the CTO of Cruise: “The reason we want lots of data and lots of driving is to try to maximize the entropy and diversity of the datasets we have.” (Entropy explained.)
In the words of Andrej Karpathy: