Separate names with a comma.
Discussion in 'Autopilot & Autonomous/FSD' started by lunitiks, Nov 5, 2017.
Yes it was.
お待たせしました！ (which is japanese for, roughly, "you rang?")
Wow, was not expecting anything out of the AP development black box. I guess Electrek articles get attention.
Elon comment: —————
“To be clear, actual NN improvement is significantly overestimated in this article. V9.0 vs V8.1 is more like a ~400% increase in useful ops/sec due to enabling integrated GPU & better use of discrete GPU.”
My interpretation here is that he’s saying the computation requirement of the network which is used to drive on V9 is 4x more than what was require on V8.1. The network I wrote about might not be the one that’s currently driving the car.
We’ve known for a while that there are multiple NN’s distributed in firmware releases but we don’t know which of the installed networks is actually doing the driving. In 2018.12, for example, there were networks for main/narrow, fisheye, and repeater in one location and in another location there were also networks for fisheye, repeater, narrow, main, and pillar. So why are there redundant NNs? Obviously some of them are being used to drive the consumer version of the car. The others could possibly be for dev versions of the car, or they could be distributed to the fleet for testing, or they could be leftovers from some other work and not used at all.
The V9 network I wrote about in that long post that electrek picked up is definitely in V9 and, because it includes a metadata file for the network, I’m pretty confident about my assertions regarding that particular network. But V9 also includes other networks that don’t have metadata and are harder to analyze for that reason. I took a quick look at them yesterday and they are quite different than the gigantic, unified, high resolution, dual frame, camera agnostic network that I wrote about. From Elon’s comment I take it that these other networks, which are evolutionary rather than revolutionary, are what is actually driving the consumer version of the car for V9.
One thing that really nags at me about that monster network is that it seems to be *too big* to be able to run on HW2 (or HW2.5). My best guess is that it can only run at maybe 3fps. I think that’s not fast enough to be usable. But on 10x faster HW3 that would be 30fps, which sounds just right.
Thanks Joe. I think you're right.
There are a few different dimensions to the issue which are easy to get mixed up. One is the size of the network, another is the amount of computation it takes to run the network in the car, and another is what resources (data/time/computation) is required to train the network. All of them grow, but to varying degrees.
The thing I find most impressive about the network I wrote about is the scale of resources needed to create it - that's the part of my post which strays into hyperbole. Second is the audaciousness of the architecture - it's ground breaking. The runtime resource requirement is also impressive, but it's the least shocking of the three. We know HW3 is coming and that it's a lot more powerful. This has to be because Tesla intends to run more resource intensive networks, so it's no surprise that the networks are getting more resource intensive.
Elon has specifically called out that the central improvement for HW3 is a custom NN processor, so it's certainly the NN processing which principally lacking in HW2/2.5. And there's this - running the same realtime control NN 10x faster doesn't get you much of an improvement in performance. If you need 10x faster hardware it's because you want to run a categorically more powerful network. That network might well look like the one I wrote about.
But there's probably a more basic misunderstanding here too - I think Elon is probably talking about a different network than the one I was talking about (as detailed in my previous comment above).
I am surprised no one yet connected your discovery with what Elon said earlier that he was testing two NN and the simpler one was doing well, that is probably what V9 is.
Ok, this makes a lot of sense! If the v9 network you've been looking at has 5x the number of weights as the corresponding v8 network, and it's processing 13x as much input data, that's probably a 60x to 70x increase in the amount of compute you'd need to run it. Elon's saying they squeezed out a 4x performance improvement from the current hardware, which is super impressive, but not nearly enough to run this new network in real time. Definitely seems like it's targeted for their custom NN chip, but I guess they're running it in shadow mode on the current hardware to start testing everything out, even if it can only run at 1fps right now.
Reminds me of Betamax vs vhs.
If it's processing the repeater and pillar cameras at only ~3fps that would explain a lot of the behavior I've seen. It's really sort of noticeably slow to react sometimes -- not something you want in a blind spot monitor honestly, especially not if you're going to do unassisted lane changes.
Call me pessimistic, but I think they're going to need HW3 to deliver on the full suite of EAP features, nevermind FSD.
So make sure the car could deliver high-quality porn? Tesla massive screen - check.
Is it possible to come up with a ballpark estimate of how much labelled training data would be needed to train a network of that size? In terms of miles of video, or hours of video, or frames, or whatever.
Is Tesla using trillions of miles of virtual driving from its simulator? Like, what kind of scale are we talking about here?
Maybe the chip is not run in real time, just to postprocess previous data when there is computational power extra, like when the car is charging. Instead they might have one network to see which data might of interest for them to test on the FSD-network, for example if the V9 network says brake and the customer in control keeps driving. Then they record this video sequence, test the FSD-network on it, see if it disagrees with the customer also, if so it uploads the video sequence for training of the network.
Nope, they do not do the learning portion of machine learning on-vehicle. The only thing that happens on-vehicle is the real time inference, which needs to happen in real time. The cars have very limited ability to store the full video feed; they can only store very short snippets which can then be uploaded to the mothership for offline learning/training of future networks. If they tried to spool all the video to disk to analyze when the car was charging it would very quickly fill the disk.
That's an interesting idea.
That's a good point. The vehicle is certainly storage constrained and storing 8 cameras worth of raw video is not trivial. That said it's not impossible that AKNET_V9 could be run on small samples (even 15 seconds might be useful if you select carefully and do it at scale with large numbers of vehicles).
I kind of doubt they are doing this but it's worth keeping in mind.
In my hypothetical scenario there was no learning in the car, only inference. Inference would pretty much just be "upload video to cloud or not". Then in the cloud a human labeler would look at video, see if it was a relevant scenario, decide what the correct label should be for example "not car" where there is a metal bar in the street the radar thought was a car, and then finally the learning would happen on the cloud.
What performance impact would you expect from using various feature detectors besides Inception v1? And how do various object detectors compare to what Tesla has done? I'm not in the ML space at all, so I only have a passing familiarity with most of these things, but what impact would implementing something like YOLO2 (or now v3) have on the overall performance? I understand YOLO isn't the most accurate detection you can get, but I wonder to what degree exacting detection matters versus being able to generally and reasonably detect "vehicle", "person", etc. I'd imagine that the diminishing returns matter much less than performance in a real world situation. Knowing a person is present versus a person on a bicycle doesn't add much value, I'd guess.
Anyway, yeah, I wonder what impact various algorithms would have, given nobody wants to be running the bleeding edge in a vehicle you may have to support for a decade or more.
30FPS at 90MPH is 4.4 feet per frame. It feels to me that you'd want to process a scene a bit faster than that. But if the NN is doing the double-frame that you mentioned, perhaps you don't have to wait 4-and-a-half feet to detect object motion?
Extra power? The vehicle is sitting on top of the equivalent of three days worth of summer time electricity usage for my entire house. The batteries contain so much energy that people use them to store solar so they can operate every appliance in their house. Plugging in to a home charger charges at about the same speed the car drives through a town. This is all to say that these batteries offer an amazing amount of electrical charge, and that even if post-processing of data consumed 700W (we know it doesn't, but let's say it did), that's still only 2-3 miles of range per hour of processing.
In either case, Tesla has previously confirmed that they don't do processing on the vehicles right now. They could in the future maybe, by using the fleet as a massive distributed compute cluster, but as of right now they buy compute from AWS at least. This was also independently confirmed when their AWS account was compromised and was being used to mine cryptocurrency.
OK, sorry, I was being over-sensitive because many many people are under the impression that the NNs in their cars "learn", and it's a bit of a pet peeve of mine. That said, my point about limited storage capacity still holds. It's possible they could keep a small amount of video and then once the car is stopped/charging they could run inference on the video to decide whether to waste the bandwidth uploading it. This seems unlikely to me because I'm not sure what standard they would apply in an automated fashion to decide whether to upload it that they couldn't do in real time -- if the standard is "the human did something really different than I would have done" then they can do that in real time before deciding to save the video in the first place. Also it's a really bad/biased standard for finding training data, IMO.
It's pretty hard to be confident of anything beyond a plus-or-minus 2 orders of magnitude estimate. So much uncertainty makes the numbers not very useful. But you can treat the public examples as a baseline and then make some guesses about how it scales - which is what I tried to do in the original post. The biggest public examples with good quality data and implementations are on the order of single digit millions of labeled images. If you needed a thousand times that much labeled data you'd be talking billions of images. If you needed a million times that much data you'd be talking trillions of labeled images. That's a big range.
One way to make labeled data it to have good labeling tools and an army of humans that you employ to label the images. This is the standard method and Karpathy has indicated Tesla is doing this, but the scale isn't known. Is it 100 people or 1000? And how much output can they generate in six months? Another way to make labeled data is to build a simulator that works with procedurally generated or database derived environments and use the simulator to synthesize labeled data. Tesla's job postings have advertized positions for these exact skills and this strongly suggests that they are also doing this. This latter system could potentially generate vast amounts of labeled data, but synthesized data is inferior to hand labeled data - you always need a substantial amount of hand labeled data to serve as a reference baseline.
Finally there are approaches which use unlabeled data and approaches which use a small amount of labeled data to leverage a much larger amount of unlabeled data. These techniques are probably the future but I would guess that right now they are not mature enough for Tesla given the operational approach that their continued use of Inception-V1 implies.
I'm not the one deciding what data should be collected so I can't know what they need. Often it is that they are missing data on some specific scenario like snow + night time + police car. Or it could be something simple like V9 has low confidence, check if FSD also has low confidence and upload all low confidence data. If you haven't already, then listen to this video:
InceptionV1 is just the structure used for the front end of the network that generates the higher level abstractions in the V8 networks. There's a set of deconvolution layers for generating bounding boxes and, probably, segmentation maps. The YOLO networks used simpler, and easier to train, front ends but the basic YOLO concept, which is to do a single forward pass to generate all your scene labels, is something that Tesla's implementation is also doing. In the AKNET_V9 net the deconvolutions are omitted - possibly moved to another downstream network or possibly deprecated in favor of some other segmentation approach. I haven't seen the insides of the other set of networks in the V9 distribution but I am expecting them to be evolutionary extensions of the V8 approach. We will see (maybe).
As for 30fps being fast enough - that seems to be about the speed that the cameras in HW2 cars are capable of. It works out to a frame spacing of 30ms or so. When you consider that our roads are designed for human reaction time, which is never less than 100ms and generally closer to 500ms, 30ms seems like it's probably ok.
How big is the actual NN binary blob in V9?