Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Tesla “can gather 1 billion miles of data per year”

This site may earn commission on affiliate links.
That quote comes from Worm Capital:

According to Tesla, the company believes it can gather 1 billion miles of data per year from current drivers.

I sent them a message asking if Tesla elaborated on how much data is actually uploaded, and what kind of data is uploaded (e.g. camera vs. GPS).

If Tesla is going to be uploading 1 billion miles’ worth of driving videos per year, that is truly momentous from a computer vision perspective.
 
How many employees would Tesla need to manually annotate 1 billion miles of video per year?

-1 billion miles per year

-0.0088888 miles travelled on average per second of video (32 mph)

-112.501 billion seconds of video

-5 seconds on average to annotate one second of video

-One employee works for 1,645 work hours per year (35 hours per week for 47 weeks), or 98,700 minutes per year

-112.501 billion * 5 seconds = 9.375 billion minutes

-9.375 billion minutes / 98,700 minutes per year = 94,985 employees

So, using these assumptions, Tesla would need 95,000 employees (triple its current workforce) to annotate 1 billion miles of video annually.

If Tesla were to outsource the annotating to developing world workers earning $2/hour, it would cost $313 million per year for 156 million hours (9.375 billion minutes) of labour. Hmm. That is actually not that much money. Tesla could probably afford to spend several times that amount.

Alternatively, Tesla could have much more efficient ways of annotating video, such that it takes less than 5 seconds on average to annotate it. For example, a labeller might watch automatically annotated video at 4x speed and only stop to manually annotate something when they spot an error. Depending on the software’s error rate and the time it takes to fix errors, that could bring the time down to less than 1 second of labour per second of video.

At 0.5 seconds of labour per second of video, that would bring the number of labelling employees needed to around 9,500. At $15/hour, their salaries would cost $268 million per year. Again, an affordable amount.
 
  • Like
Reactions: kbecks13
How many employees would Tesla need to manually annotate 1 billion miles of video per year?

-1 billion miles per year

-0.0088888 miles travelled on average per second of video (32 mph)

-112.501 billion seconds of video

-5 seconds on average to annotate one second of video

-One employee works for 1,645 work hours per year (35 hours per week for 47 weeks), or 98,700 minutes per year

-112.501 billion * 5 seconds = 9.375 billion minutes

-9.375 billion minutes / 98,700 minutes per year = 94,985 employees

So, using these assumptions, Tesla would need 95,000 employees (triple its current workforce) to annotate 1 billion miles of video annually.
Something way off with your numbers - since there are 200,000 Teslas in the wild, you're saying it would take half a person to track one, even going a 5X real time.

Lets go at it differently - A billion miles at 32 MPH is about 30,000,000 hours. If one employee can go a 5X real time, and works 2000 hours a year, that's 10,000 hours per employee. That means 3000 employees.

My question is, Why have humans annotate all the footage. Why not just look for "incidents" (close encounters, swerves, emergency braking, etc) and back up from there? You can, for instance, look for spots along the road that had repeated incidents. Lots of scenarios that don't require humans to analyze.
 
  • Like
Reactions: lklundin
Another interesting tidbit from a recent Electrek tweet, Tesla’s are averaging almost 20 million miles per day. That’s over 7 billion a year, which makes the 1 billion per year if road data into a little bit of perspective.

In terms of just Hardware 2 cars: there are roughly around 200,000. Each one drives 32 miles per day on average, and 8 miles per day on Autopilot. So that’s 6.4 million miles per day total (an annual run rate of 2.3 billion), and 1.6 million miles per day on Autopilot (run rate of 584 million).

The number of HW2 cars should be updated quarterly because Model 3 deliveries are still ramping and of course that is going to make huge difference to the annual run rates. You can also toy around with spreadsheets and extrapolate into the future.
 
If one employee can go a 5X real time, and works 2000 hours a year, that's 10,000 hours per employee. That means 3000 employees.

I was actually assuming 0.2x real time. So 31.25 million hours of footage would create 156.25 million hours of work divided amongst employees who work 1,645 hours per year. That’s 95,000 employees.

But of course if you assume they can go 2x real time, or 3x, or 5x, or 10x, or 20x, then the amount of work hours and employees needed is commensurately less.

My question is, Why have humans annotate all the footage. Why not just look for "incidents" (close encounters, swerves, emergency braking, etc) and back up from there? You can, for instance, look for spots along the road that had repeated incidents. Lots of scenarios that don't require humans to analyze.

I would bet they pay close attention to any incidents or noteworthy events like that.

But they also have to get the computer vision neural network(s) up to the level where they can go 200 million miles without making an error that would cause a fatal accident. So I bet there is a lot of labelling of cars, cyclists, pedestrians, barriers, sidewalks, trees, etc. trying to get that error rate down to superhuman levels.
 
You don't have to annotate 1 billion miles to be useful. You just have to annotate disengagements.

For instance I disengaged about 6 times over 250 miles. So that has already reduced the human intervention by a factor of about 50x. Next, the cameras usually are correct, it's the path finding logic that's off. As an example: I would disengage to steer toward the center of the lane if it was drifting toward the outer edge with a truck. You wouldn't have to annotate "Lane" "Truck" "Sign" etc in that situation you might just use the arrow key on the keyboard to shift the optimal lane position to the other side. Then with enough examples it'll learn to give vehicles extra room when you overtake them if there is nothing on the other side.

I wish for instance that while driving we could adjust "trim". We could drive in "Autopilot training" mode. In this mode Tesla could allow small amounts of force to nudge the car path. This could automatically capture a billion miles of extra training data with 0 human hours. Then the neural net could integrate the average of a billion hours of 'trim' data which would be nothing more than a -1.0 + 1.0 float value to cheat the lane position one way or the other.

Similarly whenever the driver adjusts the speed limit, I turn down TACC when approaching a corner so that it can properly navigate at the correct speed. That's free data again for Tesla. "When did the user deliberately reduce speed below the speed limit?" It might notice that when the windshield wipers are at setting "4" average speed is LIMIT-5mph.

Supervised learning is essential for teaching machine learning. But driving behavior can be unsupervised to a large extent since the data set is so massive. Waymo has a few dozen cars on the road. Tesla has a few hundred thousand. Once you refine the vision algorithms you can just upload metadata. First person shooters are a good example of this. You don't upload 16 player's 1080p rendered views to the central server, you just stream a few KB of metadata per second. An object's x,y position and category could be fit into 16 bits each (48 bits total). Even with 200 'interesting' objects in view (pretty high for average I would wager) that's a grand total of 9.6kb per frame. Assume 10 hz = 96kb/second * 60 seconds * 60 minutes = 34MB/hr * 30 days/month = 1GB/ month in data. That's a fleet wide dataset of only 300TB/year. That's still a sufficiently large interesting dataset for machine learning... while simultaneously being well within industry standards. That's less than one Backblaze storage pod and would require zero human intervention to collect.
 
Next, the cameras usually are correct
That's a BIG assumption you have right there, don't you think?

First person shooters are a good example of this. You don't upload 16 player's 1080p rendered views to the central server, you just stream a few KB of metadata per second
Important difference here is that in FPS the game is 100% sure about what it sees, other than that you are right, you can get by just on the objects if you accept this data is sometimes garbage (both ways - objects that are fake and real objects not represented). I suspect you can never have 100% vision in the foreseeable future. Even humans are fallible.
Additionally you need more than 48 bits. you need things like speed, like how deep the objects are and such. And not all objects are created the same. A traffic signal object for example has extra states to a "vehicle" object.

Edit: And I want to add that there's a way to relatively cheaply increase accuracy a great deal here - just include the actual camera picture every once in a while.
 
Last edited:
That's a BIG assumption you have right there, don't you think?

No assumption, I'm looking at the AP display of what it thinks it's seeing. It's driving badly with good information in nearly every disengagement I've encountered. ;) If it needs hand annotation it's probably going to be caught by the in-house team's data collection group or specifically flagged for collection by the fleet "Show us things that this NN thinks are buses" in shadow mode.

you need things like speed, like how deep the objects are and such.
Fair although it wouldn't take much for that data. You're right that Bounding Box would be critical for everything so I'll say 16 bits for BB. Also maybe 32 bits for categorization instead of 16 should give you enough GUID data for every object flag in the universe. We could add another 16 bits for two 8 bit velocity vectors but I assumed in my high 200 object tracking example that 90% of those objects aren't moving, e.g. parked and therefore not worth tracking. But a 1bit flag could signify whether to expect those 16 bits or not. So about 3 bits on average.

Even with all of those changes we're from 48 bits\1GB month to only up to 84 bits per object. And that's uncompressed (except for our static\moving flag). We could zip those 84 bits probably easily into less than 48 bits losslesslly. And there probably aren't 200 objects being tracked. So I think what's important is that with zero supervision you could get a lot of really useful info with less than 1GB of data\vehicle\month.
 
No assumption, I'm looking at the AP display of what it thinks it's seeing.
don't it's kind of disconnected from reality. stay tuned for more information later today.

And there probably aren't 200 objects being tracked.
yeah, looks like it's only 48.

But you still vastly underestimate the data requirements in my view, even though definitely full video is not really required, but some still frames are mandatory or you are totally oblivious to all the false positives and all the missed undetected objects
 
Last edited:
Similarly whenever the driver adjusts the speed limit, I turn down TACC when approaching a corner so that it can properly navigate at the correct speed. That's free data again for Tesla. "When did the user deliberately reduce speed below the speed limit?" It might notice that when the windshield wipers are at setting "4" average speed is LIMIT-5mph.

The problem here is "context". The car has recorded the reduction in speed, but it has to guess why you did that. Was it because of a speed sign it missed? The upcoming corner? Other traffic? Raw data without context is not very useful, assumed context is dangerous.
 
@im.thatoneguy so in light of the recent reveal of internal autopilot workings in Seeing the world in autopilot, part deux
I made an example video rendering for you that roughly shows the sort of metadata we are talking about (reduced - does not show object types, attributes and a few other things, but you get the rough idea, lets assume the attributes are really there, look at the other videos for the idea of what's possible to get, and there's somewhat more data than that, you can assume full radar matched data for every moving tracking object (when available - which is not always)).

Even this much data is actually using a lot more space than you estimated, but just looking at that visualization, I hope it's pretty evident you must have at least some "picture frames" to make sense of any of that.

 
  • Informative
Reactions: AnxietyRanger
I feel like I can understand very clearly what's happening without any picture info at all. I can't navigate (without a GPS map) and it needs to add more metadata and more objects for city driving (stop lights, traffic signs, etc). But I feel like I'm seeing less data overall not more than I estimated (except for the 'drivable area' and a wavelet encoder could probably do wonders on compressing that down to next to nothing.)

Also your 2D representation is underselling what the NN knows since the car is still seeing in 3D as far as lane markers as you mention and it's got the meta data of velocity and distance (partially from radar). It also has supernatural senses like velocity for each bounding box so while it looks like the road is just kind of wiggling back and forth the car of course 'knows' that it's traveling 30mph. So if anything the illustration is under representing how much is known.

You could call it dangerously unreliable, I call it unfiltered. Sure the IDs are jumping around wildly and there is no continuity but this level of metadata is the perfect starting point for temporal filtering.

The problem here is "context". The car has recorded the reduction in speed, but it has to guess why you did that. Was it because of a speed sign it missed? The upcoming corner? Other traffic? Raw data without context is not very useful, assumed context is dangerous.

I'm definitely talking about driving through a city. Tesla needs to solve EAP first and get interstate driving down before moving to FSD. And their driving logic is terrible even for interstate driving. With a billion mile dataset I think you could train a very effective NN without any supervision. And if you find outlier cases (disengagements). Then you can do a supervised investigation to see if your metadata is insufficient. You can narrow what contexts are important and not accounted for in your dataset by training a NN and testing it on a real world test database instantaneously. That's like the dream of data science, instead of worrying about your test dataset being non representative you can in Tesla's case just push out a shadow NN, see how it performs and register discrepancies. If everybody slows down at geohash abc1234 even though the NN says "full speed ahead" you can spend some of your valuable human resources to investigate. Maybe you did miss a sign. Great, your billion miles of unsupervised learning being tested with humans as a benchmark just automatically identified an area of interest for focused investigation.

The question isn't whether or not the vision NN is perfect. It obviously isn't. The question is whether or not a billion mile dataset of metadata could be used to improve the driving logic, and I think it could vastly improve Tesla's drive logic because that's probably all that the current logic is trained on, just a smaller dataset. It would definitely capture naturally many of the things humans do. If a semi encroached your lane it would move over a bit if the left lane wasn't also encroached. If there is 2 lanes and then 500ms later only 1 wide lane it would note that humans stay close to the lane they last saw, not center up into the giant lane. If there was 1 lane and then the lane got wider with an upcoming lane divider it would move over to the right lane instead of waiting until it was right at the lane divider. I can see all of these things on the extremely filtered AP view today... but it's not learning any of it from us humans. A billion mile dataset would be plenty to see thousands of examples of each of these scenarios. You don't need to know the "Context" because it's just something that we do. And again if you find contexts that are causing disengagements then you go back to your lower level visual systems and train up more meta data that will give the system the needed meta data to infer that context.
 
The question isn't whether or not the vision NN is perfect. It obviously isn't.
Well, if it is NOT then you obviously need to include at least some vision data to btter understand things. imagine in tha tpreview picture for hte last video there's also undetected car in the left lane... kind of changes the whole thing really!
improve Tesla's drive logic because that's probably all that the current logic is trained on, just a smaller dataset
huh? currently the driving is not by NN so you cannot exactly "train" it.

Can they do real end-to-end NN? maybe? but it appears to be loong ways off.
 
imagine in tha tpreview picture for hte last video there's also undetected car in the left lane... kind of changes the whole thing really!

That would be a failing of a lower level vision system. I'm talking about higher level driving logic. Deciding if it's "safe" to change lanes isn't based on a photo, it's based on meta data. "Are there cars in the neighboring lane. How fast are they approaching. Will I be rear ended?" That sort of logic isn't going to be calculated based on "IMAGE" it's going to be based on simple low-bandwidth meta data. And training those vision systems doesn't need 1 billion miles of data, and it requires extensive human supervision.

BUT... AI in games doesn't render out a viewpoint of the character for path finding, they have an extremely sparse data set that the AI, whether that's a NN or hard code, processes to path-find a route. Just like obviously Google Maps doesn't calculate routes based on satellite imagery. If the NN that processes sat data into road splines is bad sure it'll route you badly and those NNs need to be constantly improved, but if the route finding is bad, it could be improved with a good NN route finding algorithm that has a crapload of data.

Tesla's "Path finding" AI is still crap. And most of its problems aren't related to the vision NN's passing bad meta data (although that happens as well) most of the Tesla driving logic is just bad even if the vision system was hypothetically absolutely perfect and never mislabeled something or missed something or saw a phantom item.

but it appears to be loong ways off.

Perhaps... 1 billion miles of training data.... ;)
 
Last edited:
I’ve had a hard time following the back-and-forth between im.thatoneguy and verygreen. Let me try to summarize im.thatoneguy’s point and see if I get it right:

If you assume that perception is solved and that path planning and/or control is the main problem, you don’t have to upload raw video. You can abstract away most of the video data and just leave a video game/computer simulation-style set of labelled objects, movement trajectories, speeds, lane lines, and so on, which can be compressed down to a few megabytes or even kilobytes. Rather than a video clip of a car moving, for example, you just have a label “vehicle” and a stored trajectory. This would allow Tesla to upload billions of miles of driving data without using an exorbitant amount of customers’ wifi or the cars’ cellular data.

I think verygreen’s point is that perception isn’t solved. Am I getting that right?