Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

HW2.5 capabilities

This site may earn commission on affiliate links.
But... and I may be missing something here... if it were me, I wouldn't just apply the same scale of image recognition to 8 different images...

You'd stitch the images together first to create the panorama, then do object recognition and path planning on a single (wide/spherical) image, and generate the surrounding localisation map from that? Seems easier than trying to do it on 8 separate image feeds, then combine those outputs later...

Stitching it all together with the correct skews for the various angles and focal lengths and focal point offsets is a substantial amount of processing effort, and I'm not seeing how you gain anything from it - especially since for normal driving there are a lot of things that a really important to find in the front images that you don't need to look for in the others - traffic signs, speed limits, traffic lights, lane lines, etc.
 
So if we cautionously assume that current network has data at 60fps (2 cameras at 30fps each) + we'll get 2 more of similar complexity NNs at another 60 fps (2x repeater + 2x pillars) = 180fps in total + a hopefully simplier wide camera NN and that just leaves the backup camera that has a totally different picture pattern, but on the other hand would not need to be used all the time.
This looks like there's a chance the whole performance would be about where it should be unless they drastically redo their NNs to make them much heavier I imagine.

I think the hardware isn't underpowered at all.

So one very important thing this analysis leaves out is that you're basing your capacity estimate on the current system. Which, as previously discussed, is feeding very low-resolution images. As @jimmy_d rightly pointed out, NNs often perform quite well on what to people would seem very low-res images, but I also need to point out that the current system (which emulates MobileEye) clearly is looking at only what's pretty much immediately in front of the car, by which I mean objects at rather short range. You can tell when driving this thing that it can't see clearly very far away. If you want long-distance vision (and believe me, for L3+ you need long-distance vision), you need higher resolution. This is what the long camera is for, after all. From what I can tell from the info on this thread combined with my own experience with AP2, not only is it downsampling the resolution but it's also cropping the image to a narrow window directly in front of the car.

So fixing all of that, required for L3+ (or even just bearable, smooth, reliable L2), is going to dramatically increase the required GPU capacity.

I haven't seen a network that is know to be able to do FSD so it's only speculation on my part, but I'd expect this chip to be able to do it if the FSD algorithm is decently mature.

Wow. Let me rephrase that for you: "Nobody has ever done anything like this before, despite trying for 30 years, and nobody has even developed algorithms capable of it. Notwithstanding this, I am confident that I know how much computing power is required, based on a cursory analysis of a primitive system that as I'm about to point out is not even of the correct architecture to do the job and looks like it was put together by an intern copying and pasting from Stack Overflow." You're clearly a smart guy who knows more than a little about CV and ML, but this is wild speculation.

I mean seriously, it's like an intern did the network architecture for this. That's a big part of why I think they must be working on something else. Because this sure doesn't feel like the product of a world class team with tons of resources.

On this I agree.

I was really not expecting to see something like this. Which is kind of what makes it interesting. I mean, I spend A LOT of time studying DL networks on the cutting edge of research and I was excited to get this thing because, hey, what kind of amazing stuff must Tesla have under the hood? Elon founded OpenAI and those guys are friggin' rock stars. Karpathy came straight over from that group. I was expecting... not this.

They are slammed, under-resourced, and being asked to meet impossible deadlines for nearly-impossible tasks with inadequate supporting infrastructure. This is a complex system involving way more than just CV and ML algorithms -- a real-life software/hardware system requires a huge amount of more boring software infrastructure that they are probably also way behind on -- just the infrastructure to handle the large amounts of training data they're supposedly collecting from the fleet requires a solid team of software engineers and several months. This isn't an academic exercise; this is the real world.

But there's this one thing I can say: the hard part of this network is a total cut and paste from the single most popular demo network out there today. Google includes it with their free tensorflow framework when you download it. This is the demo network that launched a thousand "deep learning 101" class projects.

Which makes it not the hard part, as others have already alluded to. Perception is just the beginning, and of the very hard problems in the autonomy space, it's the closest to being "easy" given ample modern hardware (and, alas, Tesla does not have ample hardware...).
 
My mind is completely blown. What the heck have they been working on?? It hasn't been building a network able to distinguish motorcycles from cars, because they downloaded that network from the internet!

The Bandaid theory that this is some fill-in for ME until the "Real EAP / AP 2.0 / FSD" is ready just doesn't sit with me. It's been a year. If they have a hundred people working on autopilot and this NN represents one summer intern's worth of work, how can the other 99.7 people's worth of work have yielded absolutely no demos or development at all? I guess it is possible that's exactly what's happening, but it sure seems like a long shot.

FWIW, I'm on 2017.40.1 and I think it's good enough for highway driving. I feel like they're finally back to 17.17.4 reliability. o_O
 
All this technical but I have a very simple question: is the lower resolution mentioned during processing the reason why the car won’t detect stopped cars on red light when you come in too fast because the car is too small for the pixel resolution?

Is it also why it get confused with old highway lines vs new ones?

How could we know they processed high res images to accurately do the FSD demo and actually see a stop sign from a safe distance away ?
 
Isn't it possible, actually plausible, that the NN deployed and evaluated by us is a placeholder legacy version? I mean, it seems obvious to me that this is the case, that the future EAP/FSD stuff hasn't been deployed and thus we can't infer or validate its existence (nor can any competitors!). Yes, this is bad for the implied benefits of a "shadow mode" but I don't think it's necessarily bad for the suspected advantage of Tesla's fleet learning. Our cars could be collecting data and shipping it home where it is used to train the model(s) we've yet to see. That wouldn't require a NN at all, just the sensors/cameras and a data feed. It would also provide excellent data for the simulator we keep hearing about...

As an aside, and I admit I am not certain of the relevance, but I once interviewed one of the guys that built the original Forza Motorsport. He told me an interesting story. Up until that time all racing games had an AI "line" which the non-player cars would use to get around each track. So, when they built Forza they applied this strategy and built an ideal "line" for every track in the game. Because of the Xbox's increased compute power the game had far better physics - each car was its own model and behaved its own way - a major selling point when they pitched the game for funding. The physics were so good (relatively speaking) that the classic, generalized line that was used for the car to follow simply didn't work. Cars would go careening off the track or into walls. Amazing, right!? Thus, they had to rethink the car AI. They subsequently took each car and had to "train" it to drive around each track, each car producing its very own line for each track based on the conditions (wet, dry) and modifications (tires, suspension, engine, etc). Cool stuff. (And probably not all that different from Tesla's simulator.)

There was a post earlier in this thread about Tesla's pathing being algorithmic with the NN (CNN, DLNN, whatever) building the world "model". That comment got me thinking about the Forza example. It also got me thinking about the recent air suspension woes in 2017.41 - it may answer why they are messing with air suspension in then first place; obviously suspension performance is a major input to the capabilities of the car, which matters in algorithmic pathing. You need to know exactly how the car is going to behave or it might go flying off the track. :)

Also, if an Xbox360 can model these physics (for all the cars in at race at once in real time), smartphones today probably can (Comma.ai), and AP2 or AP2.5 absolutely can.
 
Last edited:
All this technical but I have a very simple question: is the lower resolution mentioned during processing the reason why the car won’t detect stopped cars on red light when you come in too fast because the car is too small for the pixel resolution?

Is it also why it get confused with old highway lines vs new ones?

How could we know they processed high res images to accurately do the FSD demo and actually see a stop sign from a safe distance away ?

I am no expert but the resolution issue is actually not intuitive to us, which I think is the source of your question. Recent neural networks essentially break images down into lower resolutions (or other reductions like color to greyscale) and learn different things from those augmented images. To us they look funny or undetailed, but to the computer they, along with a number of other augmentations, collectively provide information that allows for accurate classification.

So, the answer (I think) is no, this is not why those things are happening.

There are probably better examples, but here are a couple of illustrations that show how CNNs break images down to learn:
https://i.ytimg.com/vi/7RIrsbu9yvc/maxresdefault.jpg
https://i.stack.imgur.com/Hl2H6.png
 
I would think this would be likely if it wasn't for the fact that it just changed in 2017.40. Maybe it's still legacy, just ... less legacy.

Supporting legacy software sometimes requires updates. Could be that. Although this might imply that legacy is going to last a bit longer than we expect (or want). Could also be them testing the deployment framework in anticipation of pushing out the real thing. We can't know for sure.
 
But... and I may be missing something here... if it were me, I wouldn't just apply the same scale of image recognition to 8 different image feeds...

Surely you'd stitch the images together first to create the surround panorama, then do object recognition and path planning on a single (wide/spherical) image, and generate the your localisation and object map from that? We know there's significant overlap on some of the camera feeds, so there'd pixels you'd be unnecessarily processing twice. Plus, I'd have assumed it's more efficient to do an object detection pass at one image rather than many, since you can make greater use of tracking and tweening etc, remove objects being detected twice etc?

Seems easier than trying to do it on 8 separate image feeds, then combine those outputs later... Probably also easier to do things like pose and surround 3DVD that way too...
This was discussed elsewhere. I doubt any multi-camera implementation is doing stitching. It's useful for human use (for example for parking), but is a waste of processing cycles for the computer. You also will be introducing stitching artifacts which don't matter for humans, but will trip up the computer.
AP2.0 Cameras: Capabilities and Limitations?
 
  • Like
Reactions: buttershrimp
Because I know things? Or because it's all over the internet / this thread. We have two GPU's but only one pascal unit. However, today they are only using one of the parkers the other is idle.
The threads I read suggests there are two parker units (keep in mind these have GPUs inside too), but the discrete GPU has not changed (still just one).

So how does that calculate to 2x the GPU capacity?

Just to do the math:
Nvidia said Parker x2 + Pascal dGPU x2 = 8 TFLOPs FP32
Nvidia said 1.3 TFLOPs for each Parker, but in other sources they say it is FP16 for up to 1.5 TFLOPs (implies 0.75 TLOPs FP32).
So that implies each Pascal dGPU is responsible for 3.25 TFLOPs FP32.
Inside the NVIDIA PX2 board on my HW2 AP2.0 Model S (with Pics!)
Introducing Parker, NVIDIA’s Newest SOC for Autonomous Vehicles | NVIDIA Blog
NVIDIA Details Next-Gen Tegra Parker SOC at Hot Chips

The GP106 that Tesla uses is quoted at 4 TFLOPs FP32, as a sanity check.
NVIDIA GeForce GTX 1060 Official Specifications and Benchmarks

So by my math, AP2.0 has Parker 0.75x1 + GP106 3.25x1 = 4 TFLOPs FP32.
AP2.5 (from my understanding) has Parker 0.75x2 + GP106 3.25x1 = 4.75 TFLOPs FP32.
 
Last edited:
jimmy_d said:
I haven't seen a network that is know to be able to do FSD so it's only speculation on my part, but I'd expect this chip to be able to do it if the FSD algorithm is decently mature.

rephrase that for you: "Nobody has ever done anything like this before, despite trying for 30 years, and nobody has even developed algorithms capable of it. Notwithstanding this, I am confident that I know how much computing power is required

It seems to me that your rephrasing is changing the meaning of my statement.

That aside, I cannot fault what you say.

Aside from a small amount of data that I occasionally find, and some simpleminded analysis which is frequently wrong, my comments are entirely speculation.

Predictions are hard, especially about the future.
 
There seems to be an obvious question here that hasn't been asked (that I've seen). How do we know the neural net that is being discussed is really the neural net and not just some random neural net? Do you have hard evidence jimmy_d that this net is actually the one running on hardware and not just some random code? For instance perhaps it is a system that was thrown together by an intern for the purposes of a failsafe algorithm to navigate off the road in the event of the main system crashing.

Addressing a couple points here.

Re small resolution: With a camera you have noise and you have resolution. When you downscale an image you're trading resolution for improved signal to noise ratio. By operating on a low resolution image you're automatically removing a lot of the visual noise which isn't probably relevant.

Re 8 cameras vs 1: Yes it's computationally expensive to do a "good" stitch of multiple cameras, but that's not what needs to happen for a Neural Net, you just need to add the data to the system. For instance the brain references your rear-view mirror data pretty efficiently even though it's a very low resolution and low FOV source of data compared to the windshield. And when you're driving you don't pickup "eyes in the back of your head" as a perfectly stitched view. Your rear-view data stream is included in your main visual data stream to your brain, but the brain segments it off and extracts meaningful data from it knowing that "that little blob of vision is separate." Often it's more efficient to bundle multiple data sources into software as a single data-stream even though they're technically different streams just so that you're copying/operating on the data once instead of having overhead of processing 8 different cameras. You could naively scale and crop the other cameras and then slap them onto the side of the main camera. There is some interesting research on how compound eye animals see. A fly is not creating a perfect spherical stitched image in their brain like how our brain synthesizes two eyes into one. How objects move between compound eye views is actually in of itself interesting to their vision systems.
 
  • Helpful
Reactions: Turing and croman
There seems to be an obvious question here that hasn't been asked (that I've seen). How do we know the neural net that is being discussed is really the neural net and not just some random neural net? Do you have hard evidence jimmy_d that this net is actually the one running on hardware and not just some random code? For instance perhaps it is a system that was thrown together by an intern for the purposes of a failsafe algorithm to navigate off the road in the event of the main system crashing.
There is only one NN at this moment, we can clearly observe it's the one being run all the time. So given absense of any other neural nets on ape at this time, this one must be the one, right?
 
I am no expert but the resolution issue is actually not intuitive to us, which I think is the source of your question. Recent neural networks essentially break images down into lower resolutions (or other reductions like color to greyscale) and learn different things from those augmented images. To us they look funny or undetailed, but to the computer they, along with a number of other augmentations, collectively provide information that allows for accurate classification.

So, the answer (I think) is no, this is not why those things are happening.

There are probably better examples, but here are a couple of illustrations that show how CNNs break images down to learn:
https://i.ytimg.com/vi/7RIrsbu9yvc/maxresdefault.jpg
https://i.stack.imgur.com/Hl2H6.png

The examples you give are talking about the intermediate representations in a deep network (hidden layers). Those can be either smaller or larger than the inputs. But in this case we know that the source input images are quite small. No matter how fancy your network, if that vehicle in the distance is reduced to occupying only part of a single downsampled pixel, the net is not going to recognize it.

(Well there have been some cool demos of single-pixel object recognition, which work because of context, but I think that in the real world you want to get at least a few distinct pixels on each distant object to reliably recognize/differentiate objects.)

To be clear, my concern here is entirely about distant objects. I think the resolution they're using is fine for things relatively close.
 
jimmy_d said:
I haven't seen a network that is know to be able to do FSD so it's only speculation on my part, but I'd expect this chip to be able to do it if the FSD algorithm is decently mature.



It seems to me that your rephrasing is changing the meaning of my statement.

That aside, I cannot fault what you say.

Aside from a small amount of data that I occasionally find, and some simpleminded analysis which is frequently wrong, my comments are entirely speculation.

Predictions are hard, especially about the future.

Yes, you're right, my "rephrasing" was definitely meant to be taken tongue-in-cheek. I should have used some emoji or other to indicate that but frankly I'm too old for that stuff.

Also, get off my lawn.
 
The examples you give are talking about the intermediate representations in a deep network (hidden layers). Those can be either smaller or larger than the inputs. But in this case we know that the source input images are quite small. No matter how fancy your network, if that vehicle in the distance is reduced to occupying only part of a single downsampled pixel, the net is not going to recognize it.

(Well there have been some cool demos of single-pixel object recognition, which work because of context, but I think that in the real world you want to get at least a few distinct pixels on each distant object to reliably recognize/differentiate objects.)

To be clear, my concern here is entirely about distant objects. I think the resolution they're using is fine for things relatively close.
The current resolution makes sense if the current network is intended to be pretty much a drop in replacement for the old Mobileye unit, which uses lower resolution cameras.

Outside of this network, is the code largely still the same for AP1 and AP2? That would suggest that this may only be a temporary measure.
 
  • Like
Reactions: Turing