Google AI: Revisiting the Unreasonable Effectiveness of Data

shrineofchance · May 25, 2021

Why does the size of Tesla’s training fleet matter? It’s well over 1 million cars, with the fleets of all competitors worldwide amounting to a combined size of well under 10,000. But is data really that important? Can you really get much more with 1 million+ cars than with 100 or 1,000? Yes.

A paper and accompanying blog post by Google AI emphasizes that neural networks continue to improve logarithmically as noisily labelled training datasets grow exponentially, up to at least the scale of 300 million examples, as long as the neural network has enough capacity (in terms of size/depth) to absorb the training signal:

Revisiting the Unreasonable Effectiveness of Data

ai.googleblog.com

From the blog post:

“Our first observation is that large-scale data helps in representation learning which in-turn improves the performance on each vision task we study. Our findings suggest that a collective effort to build a large-scale dataset for visual pretraining is important. It also suggests a bright future for unsupervised and semi-supervised representation learning approaches. It seems the scale of data continues to overpower noise in the label space.“

Another important excerpt:

“It is important to highlight that the training regime, learning schedules and parameters we used are based on our understanding of training ConvNets with 1M images from ImageNet. Since we do not search for the optimal set of hyper-parameters in this work (which would have required considerable computational effort), it is highly likely that these results are not the best ones you can obtain when using this scale of data. Therefore, we consider the quantitative performance reported to be an underestimate of the actual impact of data for all reported image volumes.”

Facebook AI later took this even further, using a dataset of 1 billion images.

In certain applications of deep learning — and my contention is that autonomous driving is one of them — the ceiling on neural network performance is imposed by the quantity of available data.

What about labelling? As many papers have shown, various techniques such as weak/noisy labelling, self-supervised learning, and automatic curation through active learning can leverage very large quantities of data for better neural network performance with an increase in hand annotation.

In the autonomous driving subdomain of planning, imitation learning and reinforcement learning can — at least in theory and somewhat in practice already — leverage real world data to train neural network without any human annotators in the loop (besides the drivers).

As a general principle of deep learning, it’s not controversial to say performance scales with data and neural network size, with no known limit as of now. It has become less controversial over the last few years that techniques like imitation learning, reinforcement learning, and self-supervised learning are highly promising, and virtually all autonomous vehicle companies as far as I’m aware have started to at least experiment with one or more of them, if not deploy them to their fleet.

If we apply this general principle in the specific case of Tesla, the inference is clear: Tesla is working under a much higher ceiling than everyone else. Everyone else combined, in fact.

It is my strong belief that this advantage will become plain to see over the next few years, perhaps starting as early as this year. I wouldn’t be surprised if, in a few years from now, people said it was always obvious this would happen.

shrineofchance · May 25, 2021

Common objections

What about lidar?

Before you bring up lidar, watch this video and then come back with an argument about why Levandowski is wrong:

Secondly, if lidar really is the secret sauce… Neural networks trained via Tesla’s production fleet can be deployed in any cars. What’s to stop Tesla from using lidar like everyone else, at the same scale as everyone else, with (non-lidar-related) neural networks that far surpass what everyone else has?

The proof is in the pudding!

Indeed, but this amounts to an argument against making predictions in general. Once everything has already happened, it’s too late to predict what will happen. Once the results are in, it’s too late to place a bet. Predicting the future inherently involves speculating about an uncertain unfolding of events.

Waymo already cracked it!

Really? Then why does Waymo not seem to think so? If they really believed they had solved autonomous driving, they would be focused on expanding, on scaling up. They don’t seem to be. Food for thought:

Why hasn’t Waymo expanded its driverless service? Here’s my theory

Suburban ride-hailing is a lousy business to be in.

arstechnica.com

Another company has more data than Tesla

No they don’t. Not the kind of data we’re talking about. For example, Uber has a massive amount of GPS data from cars, but that is useless for training an autonomous driving system. Other companies’ data collection operations are just not comparable at all. Mobileye, for instance, doesn’t have the ability to upload sensor data or push new firmware to the car.

Microterf · May 25, 2021

Great read, and I enjoyed the Levandowski interview.

I think you're correct in that they have the lead currently, but I am curious as to how long it would take someone partnering with say Toyota to eclipse that lead.

I'm surprised that Cruise hardware isn't in every GM vehicle for the last 3 years for the purpose of getting lots of real world data.

Bladerskb · May 25, 2021

shrineofchance said:
Why does the size of Tesla’s training fleet matter? It’s well over 1 million cars, with the fleets of all competitors worldwide amounting to a combined size of well under 10,000. But is data really that important? Can you really get much more with 1 million+ cars than with 100 or 1,000? Yes.

This is blatantly false.

Xpeng ALONE has over 25k P7 with 12x cameras, 5x 5th gen radars, 12 ultrasonics.
Tesla on the other had has 8 cameras, 1x 4th gen radar, 12 ultrasonics.

This statement is built on misinformation, lack of knowledge or complete disregard on what's going on in the AV industry.

You should be able to take anyone's thesis and apply it to any company but your statements are completely biased and tesla centric. It doesn't matter what any other companies does because they are not Tesla.

For example here is an article you wrote in 2017. This matters because it shows how you have shifted your logic consistently to cater to whatever Elon/Tesla is saying/doing. That's why HD maps, lidar, camera and radar are crutches and useless in the TSLA communnity.

Tesla Has An Immense Lead In Self-Driving (NASDAQ:TSLA)

In self-driving, the company with the most data will win. Tesla currently has access to the most driving data by far.

seekingalpha.com

"In self-driving, the company with the most data will win. Tesla currently has access to the most driving data by far. Its lead will widen further unless competitors move fast. Tesla's cars with "full self-driving hardware" (HW2) are currently driving over 1 million miles per day. (The math: Tesla sold at least 14,000 HW2 cars in Q4 2016. Assuming it has sold 1500 cars every week of production - the average for 2016 - since the beginning of the year, there are 27,500 HW2 cars driving an average of 37 miles per day.)"

Here you say that Tesla has immense lead in autonomous driving based on this criteria

1) They have 8 cameras & 1 forward radar on 27,500 cars
2) They have a 2016 half variant of Nvidia Drive PX2

If we applied the same logic. Xpeng should have an immense lead in autonomous driving.
Even more, if we applied the logic that you have today, that the SOTA ML is available to everyone and compute is easily attainable and what matters is how much data you have.

If we converted your statements into unbias logic, guess who is #2 in autonomous driving? Xpeng.
This isn't even taking into account sales of Xpeng G3 which has 8 cameras, 3 radars, etc.

shrineofchance said:
It is my strong belief that this advantage will become plain to see over the next few years, perhaps starting as early as this year. I wouldn’t be surprised if, in a few years from now, people said it was always obvious this would happen.

You said the same thing years ago...Why is it any different now?
You either don't know or can't acknowledge what other companies are working on and the details of their development and what they are about to release.

Bladerskb · May 25, 2021

shrineofchance said:
Common objections

What about lidar?
Before you bring up lidar, watch this video and then come back with an argument about why Levandowski is wrong:

What Anthony is saying and what people insinuate he is saying are two different things.

What People insinuate he is say:

Lidar is not needed for L4+
Lidar adds nothing to autonomous driving and is therefore useless
Vision is all you need for L4/L5 and Anthony is proof that Tesla will have it by the end of 2019,2020, 2021, etc

What Anthony is actually saying:

"True level 4 or 5 vehicles will not arrive for many more years." (this has been repeated by all SDC companies for many years)
Prediction and Planning is the current problem in the AV industry (this has been repeated by all SDC companies for many years)
"Lidar provide amazing sensing"
"HD maps provide amazing sensing"
Lidar is expensive
HD maps are not scalable
Pronto.AI Anthony's new company is using cameras-only to provide L2 highway system for trucks.

Here is Anthony's timeline and work with Lidar

2007- January 2016 | He used Velodyne's $75k HDL-64E lidar on the Google Self Driving Test Cars.

January 2016 - July 2016 | He used Velodyne's $75k HDL-64E lidar and other lidars on OTTO truck that cost tens of thousands of dollars. All together the sensor hardware on the truck cost $150k-$200k.

July 2016 - May 30, 2017 | He used Velodyne's $75k V64 lidar among other lidars that cost tens of thousands of dollars on the Uber Self Driving Test Cars. All together the sensor hardware on the truck cost $150k-$200k.

2018 - He made this post about Pronto.AI that included statements about Lidar and the state of autonomous driving

Here is Lidar specs and cost and reliability when Anthony was working on it:

Velodyne HDL 64E - $75k

Key Features:

64 lines
50m (10% reflectivity), 120m (80% reflectivity) range
360° Horizontal FOV
26.9° Vertical FOV
0.08° angular resolution (azimuth)
<2cm accuracy
~0.4° Vertical Resolution
Breaks down regularly (weekly/monthly)

https://hypertech.co.il/wp-content/uploads/2015/12/HDL-64E-Data-Sheet.pdf

Here it is today:

Luminar Iris - $500

Key Features:

640 lines
500m max range
250m at <10% reflectivity
120° Horizontal FOV
30° Vertical FOV
0.07° horizontal resolution
1cm accuracy
0.03° Vertical Resolution
Dust & Water Ingress, Vibration & Shock certified
Production grade with hundreds of years of driving reliability

So why didn't Anthony's new company Pronto.AI not use lidar? Well its easy. After being shunned from the industry. Anthony's new company wouldn't be able to get the funding (tens/hundreds of millions) necessary to develop a L4 system that involved at the time using fleets of cars that cost $250k. The only pivot he could make was to develop a L2 camera based system without lidar. So he bought Comma.AI hardware and used Comma's ODBG interface software/hack to run his code. This is about as bare-bone as it could get.

In conclusion: Anthony repeated what every one was already saying. That Lidar was currently expensive, that HD Maps wasn't scalable. That Camera-only is possible in the distant future as CV gets better. That Lidar gives you amazing sensing. That HD map gives you amazing localization. That prediction and planning is the current Achilles heel in the AV industry not perception which is solved through sensor fusion.

Nothing he said contradicted anything anyone else have said.

What has evolved since he wrote that article and gave that interview is there are now production grade lidars that are orders of magnitude better than the lidar he was using and yet cost orders of magnitude less while being more robust and reliable. So he can get that "precise amazing sensing" he says lidar has for only $100-$250 dollars in mass production.

Secondly, HD Map which was before thought to be un-scalable has now been scaled by Mobileye all over the entire world and the process is 100% fully automated including the creation and the updating process.

This both invalidates the only two reasons he gave.

As every company says prediction and planning is harder than perception. The one thing to note is that to solve prediction and then planning, you would need to solve perception and that has only been solved using sensor fusion.

shrineofchance said:
Secondly, if lidar really is the secret sauce… Neural networks trained via Tesla’s production fleet can be deployed in any cars. What’s to stop Tesla from using lidar like everyone else, at the same scale as everyone else, with (non-lidar-related) neural networks that far surpass what everyone else has?

What are you trying to say here?

ohmman · May 25, 2021

Moderator Note: Some posts moved away to snippiness. Not all were snippy, but they were part of a discussion that shouldn't have been in this thread. Thanks for staying away from the personal discussions and sticking on topic.

gearchruncher · May 25, 2021

Ok, so let's discuss data collection. You claim Tesla has a wealth of data, more than anyone else, and this creates a lead in autonomy.
But what about the quality of data? Every time I see this, I hear Tesla has X million/billion miles of data. Except all they generally have is metadata. They clearly are not uploading videos of every car driving all the time, or even just on AP. We can all tell than since they don't upload gigabytes after every drive. Machine learning / vision systems need this raw video to do anything as the "training signal."

So why does it matter at all that Tesla has a bunch of cars out in the world? Aren't 1,000 vehicles logging data all the time with professional drivers giving feedback more useful than 1,000,000 cars just sending back disengagement metadata and a video clip now and then, with the trigger to send those video clips created by the very same network that is being trained?

99% of my disengagements with AP are just because AP is totally incapable of doing the thing I need it to (like slowing down for a red light or moving over for stopped traffic a quarter mile ahead). How do my disengagement stats even help to learn where AP needs to be trained more, or even form a useful metric on if changes are leading to a more useful system?

Tesla's own fracturing of autopilot among AP/EAP/FSD with different feature sets makes this even worse. My car doesn't stop for stop signs/red lights, but other cars do! So I have way more disengagements than a FSD car, just because I didn't want to pay $5K for this feature. Now my data is also worth less to Tesla. It seems if this data was so darn valuable to them, creating immense future value, that they would want as many people using as many features as much as possible, not maximizing their revenue this quarter.

qdeathstar · May 25, 2021

The real thing is: who cares.

no one has deployed full self driving. It is impossible to know what is going on inside of a company.

Having alot of data isn’t important, exploiting the data you do have is what’s important.

If you have a million trees and no mill your not going to produce as much studs as the dude next to you with 50000 trees and a mill.

We really don’t know how well Tesla is exploiting the data they collect, so we have no way to know how meaningful that data is.

shrineofchance · May 25, 2021

Check out this blog post for an explanation of one technique for sourcing highly salient video clips out of an enormous quantity of video:

Scalable Active Learning for Autonomous Driving: A Practical Implementation and A/B Test

Learn how our scalable active learning approach streamlines training data selection for autonomous driving DNNs.

medium.com

The first 15 minutes of this video gives an overview of data curation at Tesla AI:

qdeathstar · May 25, 2021

@shrineofchance

Would like to hear your thoughts on this:

Xpeng ALONE having over 25k P7 in their fleet collecting data with 12x cameras, 5x 5th gen radars, 12 ultrasonics

shrineofchance · May 25, 2021

Is Xpeng using production fleet learning with these vehicles or not? And if so, in what form exactly? What sources substantiate this?

There are lots of vehicles equipped with cameras, radar, and ultrasonics that don't upload sensor data and don't do firmware updates.

gearchruncher · May 25, 2021

shrineofchance said:
Check out this blog post for an explanation of one technique for sourcing highly salient video clips out of an enormous quantity of video:

By definition, this means they are not processing all data. Does it really count to say you have 1B miles of data if 999M of those have been thrown away and can never be re-analyzed, and that may contain all sorts of edge cases that you hadn't yet considered and programmed your system to capture when they occurred.

Like I say, 1,000,000 Tesla vehicles could easily be equivalent to 10,000 professionally operated vehicles that collect 100% of data. So the claim that Tesla has more data than anyone is unproven.

I love that Tesla's own data shows that only 200,000 automated lane changes have ever been made by NoA. That's only 1 out of every 5 cars having *ever* done a single one. One automated lane change every 5,000 miles with NoA engaged. How long would it take a fleet of 100 cars to pull off 200,000 lane changes? Maybe a month doing 50-100 a day?

EDIT: Also, they blend their stats with total AP miles, but they have AP1, HW2, and HW2.5 that can't do a lot of what they base detection off of, so those miles are a lot less useful for future learning. All their cars up to 2019 are low value for this effort.

shrineofchance · May 25, 2021

gearchruncher said:
1,000,000 Tesla vehicles could easily be equivalent to 10,000 professionally operated vehicles that collect 100% of data.

No, because both fleets use the same kind of techniques to curate data.

gearchruncher said:
I love that Tesla's own data shows that only 200,000 automated lane changes have ever been made by NoA.

I believe you're referencing a very old figure that was disclosed not too long after Navigate on Autopilot was released.

qdeathstar · May 25, 2021

shrineofchance said:
Is Xpeng using production fleet learning with these vehicles or not? And if so, in what form exactly? What sourced substantiate this?

well, your thesis is that it is obvious that data is important. If it is obvious data is important why wouldn’t Xpeng collect the data?

shrineofchance · May 25, 2021

I see lots of references to Xpeng developing an "in-house" solution for autonomous driving, but I also found this on Nvidia's blog (April 2020):

“Development of the P7 began in Xpeng’s data center, with NVIDIA’s AI infrastructure for training and testing self-driving deep neural networks.

With high-performance data center GPUs and advanced AI learning tools, this scalable infrastructure allows developers to manage massive amounts of data and train autonomous driving DNNs.

Xpeng is also using NVIDIA DRIVE OS software, in addition to the DRIVE AGX Xavier in-vehicle compute, to run the XPilot 3.0 system. The open and flexible operating system enables the automaker to run its proprietary software while also delivering OTA updates for new driving features.”

gearchruncher · May 25, 2021

shrineofchance said:
I believe you're referencing a very old figure that was disclosed not too long after Navigate on Autopilot was released.

Right from the video you gave me just about one minute in. 1B miles on NoA (1/3 of all AP miles) and 200,000 automated lane changes. If it's 1/3 of all the AP miles it couldn't be that old.
Tesla released NoA in October 2018.

gearchruncher · May 25, 2021

shrineofchance said:
No, because both fleets use the same kind of techniques to curate data.

No they don't. Other systems can collect all data, store it, and re-run training on it, or new curation techniques. They can literally run a brand new algorithm on a full drive done years ago.

Tesla's data is gone forever if it's not flagged in real time.

This is why other systems have a much higher quality than Tesla's, and why the measurement of only miles driven or vehicles in fleet is meaningless as a metric of how much data they have.

shrineofchance · May 25, 2021

qdeathstar said:
well, your thesis is that it is obvious that data is important. If it is obvious data is important why wouldn’t Xpeng collect the data?

It's a good question. As Microterf said above, companies like GM/Cruise could do larger scale data collection but don't. It could be a number of things in the case of GM/Cruise: lack of investment (i.e. not wanting to hurt thin profit margins given conservative shareholders and a troubling financial situation overall), lack of software expertise (this has proven to be a debacle at Volkswagen), worry about privacy/security failures, etc.

It's also a good question why most auto OEMs still don't do OTA updates when Tesla's been doing them since dinosaurs roamed the Earth.

I don't know much about Xpeng, but they seem to be using a lot of software from Nvidia.

gearchruncher · May 25, 2021

shrineofchance said:
It's also a good question why most auto OEMs still don't do OTA updates when Tesla's been doing them since dinosaurs roamed the Earth.

Ahh yes, the company that has been selling OTA cars for 9 years, and as of 3 years ago had sold about 100k cars total.

It's reasonable to not focus your company on OTA. It is a double edged sword. Some companies clearly see that they encourage software to be released before it's ready and tested, and OTA is expensive to support (Tesla's old HW is already out of date for modern 4G and WiFi networks), and it drives an expectation from your customers that they will get updates? It causes all sorts of crazy issues like removing features (charge rate reduction, range reduction, removing FSD from cars after they are sold).

shrineofchance · May 25, 2021

gearchruncher said:
Right from the video you gave me just about one minute in. 1B miles on NoA (1/3 of all AP miles) and 200,000 automated lane changes. If it's 1/3 of all the AP miles it couldn't be that old.
Tesla released NoA in October 2018.

Either in that video or a similar one from 2020 he says it's an old statistic.

He gives the same 200,000 lane changes stat in his PyTorch talk from November 2019 (see 10:20):

Clearly he's not updating this statistic.

Google AI: Revisiting the Unreasonable Effectiveness of Data

she/her, they/them

she/her, they/them

Common objections​

What about lidar?​

The proof is in the pudding! ​

Waymo already cracked it! ​

Another company has more data than Tesla ​

Member

Senior Software Engineer

Senior Software Engineer

Common objections​

What about lidar?​

Key Features:​

Key Features:​

Upright Member

Well-Known Member

Completely Serious

she/her, they/them

Completely Serious

she/her, they/them

Well-Known Member

she/her, they/them

Completely Serious

she/her, they/them

Well-Known Member

Well-Known Member

she/her, they/them

Well-Known Member

she/her, they/them

Similar threads

Common objections

What about lidar?

The proof is in the pudding!

Waymo already cracked it!

Another company has more data than Tesla

Common objections

What about lidar?

Key Features:

Key Features: