FSD v12.x (end to end AI)

Daniel in SD · Jun 7, 2024

FSDtester#1 said:
Tried 3 times with every trick in the book including service autopilot check, cannot get software check to update from 2 days ago. That has never happened to me before. Will try again after my next refill.

Love the disclaimer:

https://twitter.com/x/status/1799282592350666844

LOL. I made the right decision not betting on this version.
I still think 12.5 will get the first 9 of performance.

Ben W · Jun 7, 2024

gottagofast said:
Maybe do it once a year? My left turn signal camera has a pinkish tint

I've seen the end stage of this syndrome. It's not pretty.

AlanSubie4Life · Jun 7, 2024

enemji said:
I am not sure we, the users, are at any position to quantify the latency. Latency could be HW or SW or both. How would you identify what it is?

It doesn’t matter. All that matters is the latency. That’s easy enough to determine. Of course it could be less but it doesn’t matter.

AlanSubie4Life · Jun 7, 2024

Daniel in SD said:
LOL. I made the right decision not betting on this version.
I still think 12.5 will get the first 9 of performance.

You would be wrong (with modestly challenging traffic conditions). I hope I am wrong! It would be good to see some progress after years.

What an incredibly long con. It’s amazing.

Ben W · Jun 7, 2024

kpanda17 said:
When FSD is driving and we get antsy and push the accelerator, does Tesla have a record that feeds back into the learning?

I'm sure they keep large-scale statistics about this, to broadly inform and fine-tune the levels of aggressiveness they offer. (Chill/Average/Assertive.) But it will probably be quite a while before such interventions become infrequent enough that they can learn anything meaningful from individual instances.

Ben W · Jun 7, 2024

enemji said:
I am not sure we, the users, are at any position to quantify the latency. Latency could be HW or SW or both. How would you identify what it is?

The acid test would probably be an emergency situation such as cross-traffic running a red light at a poor-visibility intersection, where near-instant reflexes would be required to avoid or minimize a crash. It's unclear whether Tesla has any sort of "fast path" to handle situations like this, or if it all goes through the same "slow" network. But that would probably be the best way to figure out the best-case response time, especially if the scenario could be faked or staged in such a way that it would fool the car's sensors, but that an actual "collision" wouldn't damage the car. (Ironically, this is much easier to do with pure vision than with e.g. radar/lidar.) A cardboard "pedestrian" jumping out from behind a car might work for this test as well. Or else the entire test could be simulated. I'm sure Tesla is doing a lot of this internally.

If an AV equipped with radar/lidar could repeatably handle this situation with a much faster response time than pure vision, that's a pretty good indication that the response time is sensor-limited, at least for a fixed amount of compute. (Because it takes a lot of processing to effectively construct a Lidar-like point cloud [or the informational equivalent] from pure vision, and this processing is unnecessary if you just have Lidar.) And there's always a tradeoff between reaction time and certainty. Tesla could easily bias the system to prefer a faster reaction time, but that would result in a lot more phantom braking. Getting or even defining the right balance here is very very tricky.

Ben W · Jun 7, 2024

On the topic of handling different traffic laws by state/city/region, it's straightforward enough to make a small data structure that encodes these laws, and attach the appropriate data structure to each training example fed into the network. Then, at inference time (while the car is actually driving), feed in the data structure corresponding to local laws as part of the input, alongside the video streams. If the training includes many examples of turning right on red when one of those data-structure bits is on, and many examples of pausing on red when that bit is off, the system will learn the correspondence. This is a highly feature-engineered approach, but could probably accurately and simply capture the vast majority of regional traffic-law variations, in the US at least. (India, not so much.)

In the limiting case (and getting back to the discussion of FSD-as-LLM), the extra input could literally be the entire set of English-language traffic laws governing the current locale. With enough compute and enough training, the NN could figure out the correspondences from its training set and learn to match them reliably. Certainly this is overkill for the next several years, but eventually as the NN's start to creep up on AGI levels of intelligence, it will become the right approach. (And then we can also have talking cars like KITT!)

AlanSubie4Life · Jun 7, 2024

Todd Burch said:
A good example would be a stoplight turning from red to green.

There’s the latency from when the light changes to when it shows up on the visualization. This would be relatively easy to measure from a video, but it would include the latency of the graphics rendering so not too useful a number without more details.

But the harder (nearly impossible) thing to measure by observing vehicle behavior would be latency from when the light turns green to when the car starts reacting to the green light.

That’s because a delay doesn’t mean the car hasn’t processed the green light yet. It may have, for example, learned from videos to wait a moment after a light changes to ensure the intersection’s clear.

Or the neural net might be indecisive about something. That’s not latency, that’s just two possible outcomes having similar probabilities in the neural net and on each compute cycle it wallows between those two solutions.

Just refer to the prior videos. Green light turning to yellow. The calculation is wrong in the post. It is more like 400ms but you can count the frames yourself.

It’s a decently good case since visibility is good and while the wrong decision was made it is clear when the decision was made.
Of course the various things you reference can be real issues but in the end the reaction time is what matters.

In the end someone, anyone, would need to post a faster reaction time (preferably several examples) in a situation where one can measure it to demonstrate that faster reactions occur. Then we can start to back up that some of your issues actually are issues.

I haven’t seen it yet.

swedge · Jun 7, 2024

Bingo!

11:00 Friday, here in Calif. Teslafi is showing 1, count them 1 car had installed 2024.25.5 with FSDS 12.4.1.

It is a 2018 M3 LR, in Calif, with 119,xxx miles. It's last version was 2024.3.25, and it's first FSD was in 11/2021, 2021.32.5.2, FSD 10.2.

So, a tester group car on TeslaFi has installed FSDS V12.4.1.

There are only 2 other pending for this version on TeslaFi, so the "rollout" is starting like a snail sprint on this Friday night. But it is a start.

stopcrazypp · Jun 7, 2024

Ben W said:
The acid test would probably be an emergency situation such as cross-traffic running a red light at a poor-visibility intersection, where near-instant reflexes would be required to avoid or minimize a crash. It's unclear whether Tesla has any sort of "fast path" to handle situations like this, or if it all goes through the same "slow" network. But that would probably be the best way to figure out the best-case response time, especially if the scenario could be faked or staged in such a way that it would fool the car's sensors, but that an actual "collision" wouldn't damage the car. (Ironically, this is much easier to do with pure vision than with e.g. radar/lidar.) A cardboard "pedestrian" jumping out from behind a car might work for this test as well. Or else the entire test could be simulated. I'm sure Tesla is doing a lot of this internally.

If an AV equipped with radar/lidar could repeatably handle this situation with a much faster response time than pure vision, that's a pretty good indication that the response time is sensor-limited, at least for a fixed amount of compute. (Because it takes a lot of processing to effectively construct a Lidar-like point cloud [or the informational equivalent] from pure vision, and this processing is unnecessary if you just have Lidar.) And there's always a tradeoff between reaction time and certainty. Tesla could easily bias the system to prefer a faster reaction time, but that would result in a lot more phantom braking. Getting or even defining the right balance here is very very tricky.

Tesla has long switched to the Occupancy Network, which has a latency of around 10ms, which hardly contributes anything.

Also, a common misconception is that lidar/radar is necessarily less computationally intensive. To actually make the data useful you still need to process it, especially to recognize which part of the signals returned you need to be concerned about and which to ignore. Also syncing the signal to the visual input also needs to happen to ensure you properly classify which object the distance/velocity signal corresponds to.

I don't think anyone has a real number for the full pipeline latency for Tesla's system. Looking at video and trying to determine it won't necessarily tell you that, since a valid output is to not change what it is currently doing (meaning the NN has already processed the given input, but it decided the best course at the moment is to maintain what it is doing).

As a comparison, I looked up reaction times of humans, which from the discussion and links above seems to be what the point was about, not the video input to NN output latency, but rather time from a signal (like a yellow light) to an observable different action like braking or swerving. The range seems to be 0.45-0.95 seconds for unimpaired drivers.
Evaluation of Driver’s Reaction Time Measured in Driving Simulator
Other tests had the range go up to 1-1.5 seconds.
https://www.researchgate.net/publication/298445430_Measuring_a_driver's_reaction_time

I found a study that specifically studies yellow light reaction time (given above talk about yellow lights). The times were considerably longer, the mean being about 2 seconds, with 50% being about 1.7 seconds.
Redirect Notice

swedge · Jun 7, 2024

Ben W said:
On the topic of handling different traffic laws by state/city/region, it's straightforward enough to make a small data structure that encodes these laws, and attach the appropriate data structure to each training example fed into the network.

The NN already uses and so must be trained on scenarios which include map data, and GPS location. I seems like these regional law variations, which are geo defined, should be included in this map data. Like left side driving in the UK. Am I missing something? It seems to me that this is not a big deal, when you consider all the stop sign locations already mapped and recognized by FSD. (I have a couple of recently removed, now phantom stop signs, which FSD still slows down for till it sees that they are not there. Hence map data was included in FSD training.)

AlanSubie4Life · Jun 7, 2024

stopcrazypp said:
The range seems to be 0.45-0.95 seconds for unimpaired drivers.

Obviously it is better than that, since I could have reacted before the car did - easily. Wasn’t even close - could have beaten it by at least 100ms. Of course, I have the advantage of anticipation. But the computer should be a lot faster than me. But it’s not.

stopcrazypp said:
Looking at video and trying to determine it won't necessarily tell you that, since a valid output is to not change what it is currently doing (meaning the NN has already processed the given input, but it decided the best course at the moment is to maintain what it is doing).

Sure. Yet no one can post videos of the computer reacting faster, consistently. Instead we hear that maybe Tesla slowed down the reactions to make it more human like. Lol.

In case you cannot tell, it really irks me that people think FSD reacts faster than an attentive human, when it is nowhere near as fast, and there is exactly zero evidence to support it being faster.

Hopefully one day soon it will get there! It seems important!

Slightly slower than human most of the time is great for an assist system I suppose.

FSDtester#1 · Jun 7, 2024

We may not be getting it for a while...

https://twitter.com/x/status/1799304490279223534

AlanSubie4Life · Jun 7, 2024

FSDtester#1 said:
We may not be getting it for a while...

https://twitter.com/x/status/1799304490279223534

Why do you say that? I don’t see anything in this list that they would be able to fix in short order. And just seems like normal stuff, could be no regression.

Ben W · Jun 8, 2024

stopcrazypp said:
Tesla has long switched to the Occupancy Network, which has a latency of around 10ms, which hardly contributes anything.

v11 used the Occupancy Network. v12 E2E, in my understanding, does not. (Although the separate pipeline for on-screen UI visualization may still use it.) And even if the Occupancy Network has a latency of 10ms, there's a lot more latency added getting from Occupancy to Control, and it's the Photons-to-Control latency that counts, not Photons-to-Occupancy.

stopcrazypp said:
Also, a common misconception is that lidar/radar is necessarily less computationally intensive. To actually make the data useful you still need to process it, especially to recognize which part of the signals returned you need to be concerned about and which to ignore. Also syncing the signal to the visual input also needs to happen to ensure you properly classify which object the distance/velocity signal corresponds to.

The E2E network would take in the raw camera feed and the raw lidar/radar feed simultaneously, which would be synced. What it does with that information is up to it, and it will learn by itself how to handle the various processing latencies. I think of the advantages of Lidar as somewhat akin to the human ear, which effectively performs a Fourier transform on incoming audio waveforms. (Rather than letting the brain's neurons do it, which would be a lot slower and more lossy.) Lidar/radar also compensate much better for situations where pure vision has fundamental difficulty; poor weather, low lighting, sun glare, and the improved signal in those situations in particular might allow for a faster and more reliable reaction time, because certainty about the environment can be gained more quickly with a less noisy signal.

stopcrazypp said:
I don't think anyone has a real number for the full pipeline latency for Tesla's system. Looking at video and trying to determine it won't necessarily tell you that, since a valid output is to not change what it is currently doing (meaning the NN has already processed the given input, but it decided the best course at the moment is to maintain what it is doing).

Which is why to accurately measure it, it's necessary to artificially construct a situation (such as a pedestrian jumping out from behind a parked car) where the known correct behavior is for the NN to change what it's doing.

stopcrazypp said:
As a comparison, I looked up reaction times of humans (which from the discussion and links above seems to be what the point was about, not the video input to NN output latency, but rather time from input to an observable different action like braking or swerving). The range seems to be 0.45-0.95 seconds for unimpaired drivers.
Evaluation of Driver’s Reaction Time Measured in Driving Simulator
Other tests had the range go up to 1-1.5 seconds.
https://www.researchgate.net/publication/298445430_Measuring_a_driver's_reaction_time

It depends hugely on whether the human is primed to react. Suppose I give you a button and ask you to tap it as soon as you see a bright flash. If you know the flash is coming in the next few seconds, your reaction time will be MUCH faster than if I tell you the flash is coming in the next few hours. Some driving situations resemble the former; some the latter. One of the strengths of autonomous systems is that they're always paying attention, so they can always be primed to react quickly.

stopcrazypp · Jun 8, 2024

AlanSubie4Life said:
Obviously it is better than that, since I could have reacted before the car did - easily. Wasn’t even close - could have beaten it by at least 100ms. Of course, I have the advantage of anticipation. But the computer should be a lot faster than me. But it’s not.

That's not scientific at all. I have linked multiple studies above that show human reaction times with test groups of dozens of people of different ages and genders is considerably longer than the 300ms you are saying it is. For yellow lights it's even worse, the average time is more like 2 seconds. 320ms is basically the best test result in that test.

Digging around, there are people arguing the engineering formula generally used to determine yellow light durations are based on a flawed assumption of 1 second reaction time and that real world reaction times especially for 85th percentile is considerably higher (and traffic law is based on 85th percentile).
https://redlightrobber.com/red/link...interval-practice-vs-required-comparisons.htm

AlanSubie4Life said:
Sure. Yet no one can post videos of the computer reacting faster, consistently. Instead we hear that maybe Tesla slowed down the reactions to make it more human like. Lol.

In case you cannot tell, it really irks me that people think FSD reacts faster than a human, when it is nowhere near as fast, and there is exactly zero evidence to support it being faster.

Hopefully one day soon it will get there! It seems important!

I don't think simply people care enough to bother combing through videos and counting frames as that doesn't sound like a fun time, especially as above human reaction times are much worse anyways, so what is the point?

stopcrazypp · Jun 8, 2024

Ben W said:
v11 used the Occupancy Network. v12 E2E, in my understanding, does not. (Although the separate pipeline for on-screen UI visualization may still use it.) And even if the Occupancy Network has a latency of 10ms, there's a lot more latency added getting from Occupancy to Control, and it's the Photons-to-Control latency that counts, not Photons-to-Occupancy.

From this article posted by others, Tesla's V12 E2E very much still uses the previous perception engine (including the occupancy network). They didn't just throw all that work out the window and started a black box from scratch.
Breakdown: How Tesla will transition from Modular to End-To-End Deep Learning

What is different is that previously planning used a combination of deep learning and traditional tree search, but they switched that to full deep learning. Then next instead of perception and planning being completely independent, they joined it together, such that the end actions also affect the perception network during training. This is what makes it "end-to-end".

The choice quote:
"But yes, it can seem more of a Black Box, but you can also see how we're still using Occupancy Networks and Hydranets and all of these, we're just assembling the elements together. So, it's a Black Box, but we can also, at any point in time, visualize the output of Occupancy, visualize the output of Object Detection, visualize the output of Planning, etc..."

Of course the above analysis is not necessarily correct, but it makes the most sense out of the theories out there given there were relatively few regressions and most of the UI was able to remain the same (so Tesla obviously is able to pull out intermediate data in between the networks just as the article claims).

Ben W said:
The E2E network would take in the raw camera feed and the raw lidar/radar feed simultaneously, which would be synced. What it does with that information is up to it, and it will learn by itself how to handle the various processing latencies. I think of the advantages of Lidar as somewhat akin to the human ear, which effectively performs a Fourier transform on incoming audio waveforms. (Rather than letting the brain's neurons do it, which would be a lot slower and more lossy.) Lidar/radar also compensate much better for situations where pure vision has fundamental difficulty; poor weather, low lighting, sun glare, and the improved signal in those situations in particular might allow for a faster and more reliable reaction time, because certainty about the environment can be gained more quickly with a less noisy signal.

As above, if using a modular approach, that lidar/radar feed will need to feed another network, which will actually add more processing required. When they were using point clouds perhaps you can claim that it may be a direct replacement for that module and thus won't add processing demand (but not actually, the point clouds done by cameras are already synced, while lidar/radar still separate need syncing). Sure, that lidar/radar feed may lead to a better decision and less chance of error, but that's not the same as saying it will require less processing power.

Ben W said:
Which is why to accurately measure it, it's necessary to artificially construct a situation (such as a pedestrian jumping out from behind a parked car) where the known correct behavior is for the NN to change what it's doing.

It depends hugely on whether the human is primed to react. Suppose I give you a button and ask you to tap it as soon as you see a bright flash. If you know the flash is coming in the next few seconds, your reaction time will be MUCH faster than if I tell you the flash is coming in the next few hours. Some driving situations resemble the former; some the latter. One of the strengths of autonomous systems is that they're always paying attention, so they can always be primed to react quickly.

No dispute on that. As linked above, yellow light reaction by humans in a traffic situation is around 2 seconds. I found another study done that was a far simpler experiment where people were just told to slam a button as soon as they saw a light flash. The reactions times were considerably faster in that case, averaging about 0.421 seconds for yellow lights. I suspect @AlanSubie4Life is considering the latter scenario in estimating his own reaction time, not actually a scenario where the human is actually in traffic and not only looking for a light that is flashing (but rather processing a lot more other objects on the road).
https://csef.usc.edu/History/2004/Projects/J0332.pdf

drtimhill · Jun 8, 2024

FSDtester#1 said:
Do you think compute speed and latency will get better once the C++ is removed completely, and it's one stack?

Unlikely. The C++ code, though unwieldy and complex to maintain, is almost certainly much lower in compute load than the NNs. That said, while the stacks hand off control they dont stop running in the backgrounds, so if they can integrate highway driving into the existing city streets NN, then there could be some optimization. However, if they have to add additional tasks then overall things will not get faster.

Todd Burch · Jun 8, 2024

drtimhill said:
Unlikely. The C++ code, though unwieldy and complex to maintain, is almost certainly much lower in compute load than the NNs. That said, while the stacks hand off control they dont stop running in the backgrounds, so if they can integrate highway driving into the existing city streets NN, then there could be some optimization. However, if they have to add additional tasks then overall things will not get faster.

When v12 was first demoed, or first came out (I forget when) both Ashok and Elon directly addressed this and said the e2e network was LOWER compute load than the heuristic C++ code. They said that e2e significantly sped things up. Which makes sense, as the inference chip is specifically designed to do tons of parallelizable matrix operations perfectly suited to NN computations.

drtimhill · Jun 8, 2024

AlanSubie4Life said:
Obviously it is better than that, since I could have reacted before the car did - easily. Wasn’t even close - could have beaten it by at least 100ms. Of course, I have the advantage of anticipation. But the computer should be a lot faster than me. But it’s not.

It's really nothing to do with reaction time, unless the car is absurdly bad (and, as an aside, there is no way you could possibly tell if you were 100ns faster or slower without some sort of video you could analyze). Humans have 200-300ms reaction times (best case input to response) but HOW they react is all over the place. VERY few human drivers know what to do in an emergency other than slam on the brakes or swerve like crazy. The former isnt bad (usually), but the latter (far too common) is what often turns an accident into a nasty mess. And that's the big difference, the car CAN make a rational decision in a split second .. far more informed than a human.

Remember all that phantom braking? And how ppl were saying "I was lucky, if there had been someone behind me I would have been rear ended!!" And yet, they WERENT rear ended (we would have heard about THAT instantly). Why? Because they car KNEW what was behind it and braked less aggressively when there was something following and adjusted the braking accordingly. Sure, it might take the car the same time to react, but it will have taken into account FAR more information then a human ever could in that fraction of a second.

A few months ago my car grabbed the steering wheel from my hand and violently jerked me from a left turn lane ½ into the oncoming traffic lane. Why? Turns out some kid had decided to suddenly move over into the same lane and was about to side-swipe me (I only figured out afterwards watching the Dashcam video). The car reacted even when I didnt catch sight of him. But wait .. the car swerved me INTO an oncoming traffic lane???? Yep, because THERE WAS NO ONCOMING TRAFFIC AT ALL IN THAT LANE.

FSD v12.x (end to end AI)

(supervised)

Chess Grandmaster (Supervised)

Efficiency Obsessed Member

Efficiency Obsessed Member

Chess Grandmaster (Supervised)

Chess Grandmaster (Supervised)

Chess Grandmaster (Supervised)

Efficiency Obsessed Member

Member

Well-Known Member

Member

Efficiency Obsessed Member

Large Member

Efficiency Obsessed Member

Chess Grandmaster (Supervised)

Well-Known Member

Well-Known Member

Active Member

14-Year Member

Active Member

Similar threads