Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
Anyone else patiently waiting for @jimmy_d to provide some additional insight/response as a result of the new information Elon has shared?

お待たせしました! (which is japanese for, roughly, "you rang?")

Wow, was not expecting anything out of the AP development black box. I guess Electrek articles get attention.

Elon comment: —————
“To be clear, actual NN improvement is significantly overestimated in this article. V9.0 vs V8.1 is more like a ~400% increase in useful ops/sec due to enabling integrated GPU & better use of discrete GPU.”
——————

My interpretation here is that he’s saying the computation requirement of the network which is used to drive on V9 is 4x more than what was require on V8.1. The network I wrote about might not be the one that’s currently driving the car.

We’ve known for a while that there are multiple NN’s distributed in firmware releases but we don’t know which of the installed networks is actually doing the driving. In 2018.12, for example, there were networks for main/narrow, fisheye, and repeater in one location and in another location there were also networks for fisheye, repeater, narrow, main, and pillar. So why are there redundant NNs? Obviously some of them are being used to drive the consumer version of the car. The others could possibly be for dev versions of the car, or they could be distributed to the fleet for testing, or they could be leftovers from some other work and not used at all.

The V9 network I wrote about in that long post that electrek picked up is definitely in V9 and, because it includes a metadata file for the network, I’m pretty confident about my assertions regarding that particular network. But V9 also includes other networks that don’t have metadata and are harder to analyze for that reason. I took a quick look at them yesterday and they are quite different than the gigantic, unified, high resolution, dual frame, camera agnostic network that I wrote about. From Elon’s comment I take it that these other networks, which are evolutionary rather than revolutionary, are what is actually driving the consumer version of the car for V9.

One thing that really nags at me about that monster network is that it seems to be *too big* to be able to run on HW2 (or HW2.5). My best guess is that it can only run at maybe 3fps. I think that’s not fast enough to be usable. But on 10x faster HW3 that would be 30fps, which sounds just right.
 
Perhaps, however, if you read the article Lambert wrote, the quoted section of Jimmy's post talked about the increased amount of training data compared to V8, and the increase in weights by a factor of 5. To that section, Lambert quoted Elon's tweet, stating a 400% increase in "useful ops/sec due to enabling integrated GPU & better use of discrete GPU."

This is a disconnect with the block of text Lambert attached Elon's response to.

Thanks Joe. I think you're right.

There are a few different dimensions to the issue which are easy to get mixed up. One is the size of the network, another is the amount of computation it takes to run the network in the car, and another is what resources (data/time/computation) is required to train the network. All of them grow, but to varying degrees.

The thing I find most impressive about the network I wrote about is the scale of resources needed to create it - that's the part of my post which strays into hyperbole. Second is the audaciousness of the architecture - it's ground breaking. The runtime resource requirement is also impressive, but it's the least shocking of the three. We know HW3 is coming and that it's a lot more powerful. This has to be because Tesla intends to run more resource intensive networks, so it's no surprise that the networks are getting more resource intensive.

Elon has specifically called out that the central improvement for HW3 is a custom NN processor, so it's certainly the NN processing which principally lacking in HW2/2.5. And there's this - running the same realtime control NN 10x faster doesn't get you much of an improvement in performance. If you need 10x faster hardware it's because you want to run a categorically more powerful network. That network might well look like the one I wrote about.

But there's probably a more basic misunderstanding here too - I think Elon is probably talking about a different network than the one I was talking about (as detailed in my previous comment above).
 
Thanks Joe. I think you're right.

There are a few different dimensions to the issue which are easy to get mixed up. One is the size of the network, another is the amount of computation it takes to run the network in the car, and another is what resources (data/time/computation) is required to train the network. All of them grow, but to varying degrees.

The thing I find most impressive about the network I wrote about is the scale of resources needed to create it - that's the part of my post which strays into hyperbole. Second is the audaciousness of the architecture - it's ground breaking. The runtime resource requirement is also impressive, but it's the least shocking of the three. We know HW3 is coming and that it's a lot more powerful. This has to be because Tesla intends to run more resource intensive networks, so it's no surprise that the networks are getting more resource intensive.

Elon has specifically called out that the central improvement for HW3 is a custom NN processor, so it's certainly the NN processing which principally lacking in HW2/2.5. And there's this - running the same realtime control NN 10x faster doesn't get you much of an improvement in performance. If you need 10x faster hardware it's because you want to run a categorically more powerful network. That network might well look like the one I wrote about.

But there's probably a more basic misunderstanding here too - I think Elon is probably talking about a different network than the one I was talking about (as detailed in my previous comment above).

I am surprised no one yet connected your discovery with what Elon said earlier that he was testing two NN and the simpler one was doing well, that is probably what V9 is.
 
Ok, this makes a lot of sense! If the v9 network you've been looking at has 5x the number of weights as the corresponding v8 network, and it's processing 13x as much input data, that's probably a 60x to 70x increase in the amount of compute you'd need to run it. Elon's saying they squeezed out a 4x performance improvement from the current hardware, which is super impressive, but not nearly enough to run this new network in real time. Definitely seems like it's targeted for their custom NN chip, but I guess they're running it in shadow mode on the current hardware to start testing everything out, even if it can only run at 1fps right now.
 
Not quite. Great analysis by Jimmy. I mean top notch. But I was going to come in here gunslinger to burst everyone's bubbles. But elon sorta did it for me.
You see it's not the size of the model that matters, it's the accuracy and efficiency!

Tesla v9 reminds me of a brutefor e approach. In comparison eyeq3 models ran on 0.25 TFLOPS. It's completely unheard of. Today phones come with 3tflop ASIC chips easily (for ex the pixel2 and 3).

The eyeq4 is only 2.5 TFLOP and yet handles all 8 cameras, plus radar and lidar and still have loads of headroom.

It's simply astonishing the NN architecture mobileye were able to create and the efficiency it runs at.

Amon has been very vocal that his approach is different than the bruteforce approach of the industry requiring 50tflops+ chips.

Although he will gladly comply and provide and market them more powerful chips.

Reminds me of Betamax vs vhs.
 
One thing that really nags at me about that monster network is that it seems to be *too big* to be able to run on HW2 (or HW2.5). My best guess is that it can only run at maybe 3fps. I think that’s not fast enough to be usable. But on 10x faster HW3 that would be 30fps, which sounds just right.

If it's processing the repeater and pillar cameras at only ~3fps that would explain a lot of the behavior I've seen. It's really sort of noticeably slow to react sometimes -- not something you want in a blind spot monitor honestly, especially not if you're going to do unassisted lane changes.

Call me pessimistic, but I think they're going to need HW3 to deliver on the full suite of EAP features, nevermind FSD.
 
  • Like
Reactions: BlueRocket
and another is what resources (data/time/computation) is required to train the network. All of them grow, but to varying degrees.

The thing I find most impressive about the network I wrote about is the scale of resources needed to create it

Is it possible to come up with a ballpark estimate of how much labelled training data would be needed to train a network of that size? In terms of miles of video, or hours of video, or frames, or whatever.

Is Tesla using trillions of miles of virtual driving from its simulator? Like, what kind of scale are we talking about here?
 
Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging. Instead they might have one network to see which data might of interest for them to test on the FSD-network, for example if the V9 network says brake and the customer in control keeps driving. Then they record this video sequence, test the FSD-network on it, see if it disagrees with the customer also, if so it uploads the video sequence for training of the network.
 
Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging. Instead they might have one network to see which data might of interest for them to test on the FSD-network, for example if the V9 network says brake and the customer in control keeps driving. Then they record this video sequence, test the FSD-network on it, see if it disagrees with the customer also, if so it uploads the video sequence for training of the network.

Nope, they do not do the learning portion of machine learning on-vehicle. The only thing that happens on-vehicle is the real time inference, which needs to happen in real time. The cars have very limited ability to store the full video feed; they can only store very short snippets which can then be uploaded to the mothership for offline learning/training of future networks. If they tried to spool all the video to disk to analyze when the car was charging it would very quickly fill the disk.
 
Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging. Instead they might have one network to see which data might of interest for them to test on the FSD-network, for example if the V9 network says brake and the customer in control keeps driving. Then they record this video sequence, test the FSD-network on it, see if it disagrees with the customer also, if so it uploads the video sequence for training of the network.

That's an interesting idea.
 
Nope, they do not do the learning portion of machine learning on-vehicle. The only thing that happens on-vehicle is the real time inference, which needs to happen in real time. The cars have very limited ability to store the full video feed; they can only store very short snippets which can then be uploaded to the mothership for offline learning/training of future networks. If they tried to spool all the video to disk to analyze when the car was charging it would very quickly fill the disk.

That's a good point. The vehicle is certainly storage constrained and storing 8 cameras worth of raw video is not trivial. That said it's not impossible that AKNET_V9 could be run on small samples (even 15 seconds might be useful if you select carefully and do it at scale with large numbers of vehicles).

I kind of doubt they are doing this but it's worth keeping in mind.
 
Nope, they do not do the learning portion of machine learning on-vehicle. The only thing that happens on-vehicle is the real time inference, which needs to happen in real time. The cars have very limited ability to store the full video feed; they can only store very short snippets which can then be uploaded to the mothership for offline learning/training of future networks. If they tried to spool all the video to disk to analyze when the car was charging it would very quickly fill the disk.
In my hypothetical scenario there was no learning in the car, only inference. Inference would pretty much just be "upload video to cloud or not". Then in the cloud a human labeler would look at video, see if it was a relevant scenario, decide what the correct label should be for example "not car" where there is a metal bar in the street the radar thought was a car, and then finally the learning would happen on the cloud.
 
  • Like
Reactions: boonedocks
NN Changes in V9 (2018.39.7)

The basic camera NN (neural network) arrangement is an Inception V1 type CNN with L1/L2/L3ab/L4abcdefg layer arrangement (architecturally similar to V8 main/narrow camera up to end of inception blocks but much larger)
  • about 5x as many weights as comparable portion of V8 net
  • about 18x as much processing per camera (front/back)
The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3M. The V8 main camera was 640x416x2 or 0.5M - 13x less data.

For perspective, V9 camera network is 10x larger and requires 200x more computation when compared to Google’s Inception V1 network from which V9 gets it’s underlying architectural concept. That’s processing *per camera* for the 4 front and back cameras. Side cameras are 1/4 the processing due to being 1/4 as many total pixels. With all 8 cameras being processed in this fashion it’s likely that V9 is straining the compute capability of the APE. The V8 network, by comparison, probably had lots of margin.

What performance impact would you expect from using various feature detectors besides Inception v1? And how do various object detectors compare to what Tesla has done? I'm not in the ML space at all, so I only have a passing familiarity with most of these things, but what impact would implementing something like YOLO2 (or now v3) have on the overall performance? I understand YOLO isn't the most accurate detection you can get, but I wonder to what degree exacting detection matters versus being able to generally and reasonably detect "vehicle", "person", etc. I'd imagine that the diminishing returns matter much less than performance in a real world situation. Knowing a person is present versus a person on a bicycle doesn't add much value, I'd guess.

Anyway, yeah, I wonder what impact various algorithms would have, given nobody wants to be running the bleeding edge in a vehicle you may have to support for a decade or more.

One thing that really nags at me about that monster network is that it seems to be *too big* to be able to run on HW2 (or HW2.5). My best guess is that it can only run at maybe 3fps. I think that’s not fast enough to be usable. But on 10x faster HW3 that would be 30fps, which sounds just right.

30FPS at 90MPH is 4.4 feet per frame. It feels to me that you'd want to process a scene a bit faster than that. But if the NN is doing the double-frame that you mentioned, perhaps you don't have to wait 4-and-a-half feet to detect object motion?

Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging.

Extra power? The vehicle is sitting on top of the equivalent of three days worth of summer time electricity usage for my entire house. The batteries contain so much energy that people use them to store solar so they can operate every appliance in their house. Plugging in to a home charger charges at about the same speed the car drives through a town. This is all to say that these batteries offer an amazing amount of electrical charge, and that even if post-processing of data consumed 700W (we know it doesn't, but let's say it did), that's still only 2-3 miles of range per hour of processing.

In either case, Tesla has previously confirmed that they don't do processing on the vehicles right now. They could in the future maybe, by using the fleet as a massive distributed compute cluster, but as of right now they buy compute from AWS at least. This was also independently confirmed when their AWS account was compromised and was being used to mine cryptocurrency.
 
Last edited:
In my hypothetical scenario there was no learning in the car, only inference. Inference would pretty much just be "upload video to cloud or not". Then in the cloud a human labeler would look at video, see if it was a relevant scenario, decide what the correct label should be for example "not car" where there is a metal bar in the street the radar thought was a car, and then finally the learning would happen on the cloud.

OK, sorry, I was being over-sensitive because many many people are under the impression that the NNs in their cars "learn", and it's a bit of a pet peeve of mine. That said, my point about limited storage capacity still holds. It's possible they could keep a small amount of video and then once the car is stopped/charging they could run inference on the video to decide whether to waste the bandwidth uploading it. This seems unlikely to me because I'm not sure what standard they would apply in an automated fashion to decide whether to upload it that they couldn't do in real time -- if the standard is "the human did something really different than I would have done" then they can do that in real time before deciding to save the video in the first place. Also it's a really bad/biased standard for finding training data, IMO.
 
Is it possible to come up with a ballpark estimate of how much labelled training data would be needed to train a network of that size? In terms of miles of video, or hours of video, or frames, or whatever.

Is Tesla using trillions of miles of virtual driving from its simulator? Like, what kind of scale are we talking about here?

It's pretty hard to be confident of anything beyond a plus-or-minus 2 orders of magnitude estimate. So much uncertainty makes the numbers not very useful. But you can treat the public examples as a baseline and then make some guesses about how it scales - which is what I tried to do in the original post. The biggest public examples with good quality data and implementations are on the order of single digit millions of labeled images. If you needed a thousand times that much labeled data you'd be talking billions of images. If you needed a million times that much data you'd be talking trillions of labeled images. That's a big range.

One way to make labeled data it to have good labeling tools and an army of humans that you employ to label the images. This is the standard method and Karpathy has indicated Tesla is doing this, but the scale isn't known. Is it 100 people or 1000? And how much output can they generate in six months? Another way to make labeled data is to build a simulator that works with procedurally generated or database derived environments and use the simulator to synthesize labeled data. Tesla's job postings have advertized positions for these exact skills and this strongly suggests that they are also doing this. This latter system could potentially generate vast amounts of labeled data, but synthesized data is inferior to hand labeled data - you always need a substantial amount of hand labeled data to serve as a reference baseline.

Finally there are approaches which use unlabeled data and approaches which use a small amount of labeled data to leverage a much larger amount of unlabeled data. These techniques are probably the future but I would guess that right now they are not mature enough for Tesla given the operational approach that their continued use of Inception-V1 implies.
 
OK, sorry, I was being over-sensitive because many many people are under the impression that the NNs in their cars "learn", and it's a bit of a pet peeve of mine. That said, my point about limited storage capacity still holds. It's possible they could keep a small amount of video and then once the car is stopped/charging they could run inference on the video to decide whether to waste the bandwidth uploading it. This seems unlikely to me because I'm not sure what standard they would apply in an automated fashion to decide whether to upload it that they couldn't do in real time -- if the standard is "the human did something really different than I would have done" then they can do that in real time before deciding to save the video in the first place. Also it's a really bad/biased standard for finding training data, IMO.
I'm not the one deciding what data should be collected so I can't know what they need. Often it is that they are missing data on some specific scenario like snow + night time + police car. Or it could be something simple like V9 has low confidence, check if FSD also has low confidence and upload all low confidence data. If you haven't already, then listen to this video:
 
What performance impact would you expect from using various feature detectors besides Inception v1? And how do various object detectors compare to what Tesla has done? I'm not in the ML space at all, so I only have a passing familiarity with most of these things, but what impact would implementing something like YOLO2 (or now v3) have on the overall performance? I understand YOLO isn't the most accurate detection you can get, but I wonder to what degree exacting detection matters versus being able to generally and reasonably detect "vehicle", "person", etc. I'd imagine that the diminishing returns matter much less than performance in a real world situation. Knowing a person is present versus a person on a bicycle doesn't add much value, I'd guess.
...
30FPS at 90MPH is 4.4 feet per frame. It feels to me that you'd want to process a scene a bit faster than that. But if the NN is doing the double-frame that you mentioned, perhaps you don't have to wait 4-and-a-half feet to detect object motion?
...

InceptionV1 is just the structure used for the front end of the network that generates the higher level abstractions in the V8 networks. There's a set of deconvolution layers for generating bounding boxes and, probably, segmentation maps. The YOLO networks used simpler, and easier to train, front ends but the basic YOLO concept, which is to do a single forward pass to generate all your scene labels, is something that Tesla's implementation is also doing. In the AKNET_V9 net the deconvolutions are omitted - possibly moved to another downstream network or possibly deprecated in favor of some other segmentation approach. I haven't seen the insides of the other set of networks in the V9 distribution but I am expecting them to be evolutionary extensions of the V8 approach. We will see (maybe).

As for 30fps being fast enough - that seems to be about the speed that the cameras in HW2 cars are capable of. It works out to a frame spacing of 30ms or so. When you consider that our roads are designed for human reaction time, which is never less than 100ms and generally closer to 500ms, 30ms seems like it's probably ok.