If Tesla really has uploaded ~9 billion images, the bigger difficulty and cost would be labelling them.
Let’s say 70% of AKnet’s classifications are correct, which is about
GoogLeNet/Inception v1’s top-1 accuracy on ImageNet. Let’s say for a correct classification, it only takes 5 seconds on average for a human labeler to look at the object and click the green checkmark (or whatever). In the other 30% of cases, let’s say it takes 15 seconds on average to select the correct label. (Which, using GoogLeNet’s top-5 accuracy of almost 90%, will be one of the first 5 items in the drop-down menu in 20% of cases.)
Let’s say each image has 5 objects on average. 5 objects * 70% = 3.5 correctly labelled objects. 3.5 correctly labelled objects * 5 seconds = 17.5 seconds
5 objects * 30% = 1.5 mislabelled objects. 1.5 * 15 seconds = 22.5 seconds
17.5 seconds + 22.5 seconds = 40 seconds per image on average
1 hour / 40 seconds = 90 images per hour
9 billion images / 90 images per hour = 100 million hours of labour
100 million hours of labour * $11/hour (California minimum wage) = $1.1 billion
$1.1 billion / 8 quarters (Q4 2016 to Q4 2018) = $110 million per quarter
Tesla’s R&D budget is about $350 million per quarter. So it wouldn’t be impossible for Tesla to pay this much, although $110 million per quarter intuitively seems too high to me.
I wonder if in reality this work is outsourced to countries where wages are much lower. The Informarion reported that Tesla uses third-party labelling companies, but it didn’t give any details beyond that.
In India, for example, a typical wage for call centre employees is
$2/hour. So, at India wages rather than California wages, 100 million hours of labour would cost $200 million in salaries, or $25 million per quarter over 8 quarters. Only 7% of the budget.
I could be greatly underestimating or overestimating the time it takes to label images. Probably AKnet’s top-1 accuracy on Tesla HW2 images is better than GoogLeNet’s top-1 accuracy on ImageNet, since 1) there are probably something like 1/10th to 1/20th the number of semantic classes in Tesla’s dataset as in ImageNet and 2) you train ImageNet challengers on 1 million images, whereas I’m postulating here that AKnet is trained on billions of images. Also, 3) the offline version of AKnet is more accurate than the online version.
On the other hand, it might be hard to maintain a 5-second pace for correctly labelled images over hours. There will be outliers where objects take minutes to label, and I don’t know how that will impact the average. On the other other hand, you could probably hit that green checkmark in about 1-2 seconds in a lot of cases.
Upshot from an upload and labelling perspective is that 9 billion labelled training images doesn’t seem like an unrealistic amount.
Now, how many GPUs would it take to train a neural network on 9 billion images?
Facebook used
336 GPUs to train ResNeXt 101-32x-48d on 3.5 billion images over 22 days.
An Nvidia V100 GPU costs
$19,000. Let’s say you want 3024 GPUs so you can train a network on ~10 billion images in about a week. 3024 * $19,000 = $57 million. No biggie with Tesla’s budget, assuming you accumulate these GPUs over time as you need them (i.e. as your dataset grows). Over 8 quarters, it’s $7 million per quarter.