diplomat33
Average guy who loves autonomous vehicles
I hope that was useful in proportion to how long it took to read.
Yes it was.
You can install our site as a web app on your iOS device by utilizing the Add to Home Screen feature in Safari. Please see this thread for more details on this.
Note: This feature may not be available in some browsers.
I hope that was useful in proportion to how long it took to read.
Anyone else patiently waiting for @jimmy_d to provide some additional insight/response as a result of the new information Elon has shared?
Perhaps, however, if you read the article Lambert wrote, the quoted section of Jimmy's post talked about the increased amount of training data compared to V8, and the increase in weights by a factor of 5. To that section, Lambert quoted Elon's tweet, stating a 400% increase in "useful ops/sec due to enabling integrated GPU & better use of discrete GPU."
This is a disconnect with the block of text Lambert attached Elon's response to.
Thanks Joe. I think you're right.
There are a few different dimensions to the issue which are easy to get mixed up. One is the size of the network, another is the amount of computation it takes to run the network in the car, and another is what resources (data/time/computation) is required to train the network. All of them grow, but to varying degrees.
The thing I find most impressive about the network I wrote about is the scale of resources needed to create it - that's the part of my post which strays into hyperbole. Second is the audaciousness of the architecture - it's ground breaking. The runtime resource requirement is also impressive, but it's the least shocking of the three. We know HW3 is coming and that it's a lot more powerful. This has to be because Tesla intends to run more resource intensive networks, so it's no surprise that the networks are getting more resource intensive.
Elon has specifically called out that the central improvement for HW3 is a custom NN processor, so it's certainly the NN processing which principally lacking in HW2/2.5. And there's this - running the same realtime control NN 10x faster doesn't get you much of an improvement in performance. If you need 10x faster hardware it's because you want to run a categorically more powerful network. That network might well look like the one I wrote about.
But there's probably a more basic misunderstanding here too - I think Elon is probably talking about a different network than the one I was talking about (as detailed in my previous comment above).
Not quite. Great analysis by Jimmy. I mean top notch. But I was going to come in here gunslinger to burst everyone's bubbles. But elon sorta did it for me.
You see it's not the size of the model that matters, it's the accuracy and efficiency!
Tesla v9 reminds me of a brutefor e approach. In comparison eyeq3 models ran on 0.25 TFLOPS. It's completely unheard of. Today phones come with 3tflop ASIC chips easily (for ex the pixel2 and 3).
The eyeq4 is only 2.5 TFLOP and yet handles all 8 cameras, plus radar and lidar and still have loads of headroom.
It's simply astonishing the NN architecture mobileye were able to create and the efficiency it runs at.
Amon has been very vocal that his approach is different than the bruteforce approach of the industry requiring 50tflops+ chips.
Although he will gladly comply and provide and market them more powerful chips.
One thing that really nags at me about that monster network is that it seems to be *too big* to be able to run on HW2 (or HW2.5). My best guess is that it can only run at maybe 3fps. I think that’s not fast enough to be usable. But on 10x faster HW3 that would be 30fps, which sounds just right.
Reminds me of Betamax vs vhs.
and another is what resources (data/time/computation) is required to train the network. All of them grow, but to varying degrees.
The thing I find most impressive about the network I wrote about is the scale of resources needed to create it
Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging. Instead they might have one network to see which data might of interest for them to test on the FSD-network, for example if the V9 network says brake and the customer in control keeps driving. Then they record this video sequence, test the FSD-network on it, see if it disagrees with the customer also, if so it uploads the video sequence for training of the network.
Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging. Instead they might have one network to see which data might of interest for them to test on the FSD-network, for example if the V9 network says brake and the customer in control keeps driving. Then they record this video sequence, test the FSD-network on it, see if it disagrees with the customer also, if so it uploads the video sequence for training of the network.
Nope, they do not do the learning portion of machine learning on-vehicle. The only thing that happens on-vehicle is the real time inference, which needs to happen in real time. The cars have very limited ability to store the full video feed; they can only store very short snippets which can then be uploaded to the mothership for offline learning/training of future networks. If they tried to spool all the video to disk to analyze when the car was charging it would very quickly fill the disk.
In my hypothetical scenario there was no learning in the car, only inference. Inference would pretty much just be "upload video to cloud or not". Then in the cloud a human labeler would look at video, see if it was a relevant scenario, decide what the correct label should be for example "not car" where there is a metal bar in the street the radar thought was a car, and then finally the learning would happen on the cloud.Nope, they do not do the learning portion of machine learning on-vehicle. The only thing that happens on-vehicle is the real time inference, which needs to happen in real time. The cars have very limited ability to store the full video feed; they can only store very short snippets which can then be uploaded to the mothership for offline learning/training of future networks. If they tried to spool all the video to disk to analyze when the car was charging it would very quickly fill the disk.
NN Changes in V9 (2018.39.7)
The basic camera NN (neural network) arrangement is an Inception V1 type CNN with L1/L2/L3ab/L4abcdefg layer arrangement (architecturally similar to V8 main/narrow camera up to end of inception blocks but much larger)
The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3M. The V8 main camera was 640x416x2 or 0.5M - 13x less data.
- about 5x as many weights as comparable portion of V8 net
- about 18x as much processing per camera (front/back)
For perspective, V9 camera network is 10x larger and requires 200x more computation when compared to Google’s Inception V1 network from which V9 gets it’s underlying architectural concept. That’s processing *per camera* for the 4 front and back cameras. Side cameras are 1/4 the processing due to being 1/4 as many total pixels. With all 8 cameras being processed in this fashion it’s likely that V9 is straining the compute capability of the APE. The V8 network, by comparison, probably had lots of margin.
One thing that really nags at me about that monster network is that it seems to be *too big* to be able to run on HW2 (or HW2.5). My best guess is that it can only run at maybe 3fps. I think that’s not fast enough to be usable. But on 10x faster HW3 that would be 30fps, which sounds just right.
Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging.
In my hypothetical scenario there was no learning in the car, only inference. Inference would pretty much just be "upload video to cloud or not". Then in the cloud a human labeler would look at video, see if it was a relevant scenario, decide what the correct label should be for example "not car" where there is a metal bar in the street the radar thought was a car, and then finally the learning would happen on the cloud.
Is it possible to come up with a ballpark estimate of how much labelled training data would be needed to train a network of that size? In terms of miles of video, or hours of video, or frames, or whatever.
Is Tesla using trillions of miles of virtual driving from its simulator? Like, what kind of scale are we talking about here?
I'm not the one deciding what data should be collected so I can't know what they need. Often it is that they are missing data on some specific scenario like snow + night time + police car. Or it could be something simple like V9 has low confidence, check if FSD also has low confidence and upload all low confidence data. If you haven't already, then listen to this video:OK, sorry, I was being over-sensitive because many many people are under the impression that the NNs in their cars "learn", and it's a bit of a pet peeve of mine. That said, my point about limited storage capacity still holds. It's possible they could keep a small amount of video and then once the car is stopped/charging they could run inference on the video to decide whether to waste the bandwidth uploading it. This seems unlikely to me because I'm not sure what standard they would apply in an automated fashion to decide whether to upload it that they couldn't do in real time -- if the standard is "the human did something really different than I would have done" then they can do that in real time before deciding to save the video in the first place. Also it's a really bad/biased standard for finding training data, IMO.
What performance impact would you expect from using various feature detectors besides Inception v1? And how do various object detectors compare to what Tesla has done? I'm not in the ML space at all, so I only have a passing familiarity with most of these things, but what impact would implementing something like YOLO2 (or now v3) have on the overall performance? I understand YOLO isn't the most accurate detection you can get, but I wonder to what degree exacting detection matters versus being able to generally and reasonably detect "vehicle", "person", etc. I'd imagine that the diminishing returns matter much less than performance in a real world situation. Knowing a person is present versus a person on a bicycle doesn't add much value, I'd guess.
...
30FPS at 90MPH is 4.4 feet per frame. It feels to me that you'd want to process a scene a bit faster than that. But if the NN is doing the double-frame that you mentioned, perhaps you don't have to wait 4-and-a-half feet to detect object motion?
...