v11 used the Occupancy Network. v12 E2E, in my understanding, does not. (Although the separate pipeline for on-screen UI visualization may still use it.) And even if the Occupancy Network has a latency of 10ms, there's a lot more latency added getting from Occupancy to Control, and it's the Photons-to-Control latency that counts, not Photons-to-Occupancy.
From this article posted by others, Tesla's V12 E2E very much still uses the previous perception engine (including the occupancy network). They didn't just throw all that work out the window and started a black box from scratch.
Breakdown: How Tesla will transition from Modular to End-To-End Deep Learning
What is different is that previously planning used a combination of deep learning and traditional tree search, but they switched that to full deep learning. Then next instead of perception and planning being completely independent, they joined it together, such that the end actions also affect the perception network during training. This is what makes it "end-to-end".
The choice quote:
"
But yes, it can seem more of a Black Box, but you can also see how we're still using Occupancy Networks and Hydranets and all of these, we're just assembling the elements together. So, it's a Black Box, but we can also, at any point in time, visualize the output of Occupancy, visualize the output of Object Detection, visualize the output of Planning, etc..."
Of course the above analysis is not necessarily correct, but it makes the most sense out of the theories out there given there were relatively few regressions and most of the UI was able to remain the same (so Tesla obviously is able to pull out intermediate data in between the networks just as the article claims).
The E2E network would take in the raw camera feed and the raw lidar/radar feed simultaneously, which would be synced. What it does with that information is up to it, and it will learn by itself how to handle the various processing latencies. I think of the advantages of Lidar as somewhat akin to the human ear, which effectively performs a Fourier transform on incoming audio waveforms. (Rather than letting the brain's neurons do it, which would be a lot slower and more lossy.) Lidar/radar also compensate much better for situations where pure vision has fundamental difficulty; poor weather, low lighting, sun glare, and the improved signal in those situations in particular might allow for a faster and more reliable reaction time, because certainty about the environment can be gained more quickly with a less noisy signal.
As above, if using a modular approach, that lidar/radar feed will need to feed another network, which will actually add more processing required. When they were using point clouds perhaps you can claim that it may be a direct replacement for that module and thus won't add processing demand (but not actually, the point clouds done by cameras are already synced, while lidar/radar still separate need syncing). Sure, that lidar/radar feed may lead to a better decision and less chance of error, but that's not the same as saying it will require less processing power.
Which is why to accurately measure it, it's necessary to artificially construct a situation (such as a pedestrian jumping out from behind a parked car) where the known correct behavior is for the NN to change what it's doing.
It depends hugely on whether the human is primed to react. Suppose I give you a button and ask you to tap it as soon as you see a bright flash. If you know the flash is coming in the next few seconds, your reaction time will be MUCH faster than if I tell you the flash is coming in the next few hours. Some driving situations resemble the former; some the latter. One of the strengths of autonomous systems is that they're always paying attention, so they can always be primed to react quickly.
No dispute on that. As linked above, yellow light reaction by humans in a traffic situation is around 2 seconds. I found another study done that was a far simpler experiment where people were just told to slam a button as soon as they saw a light flash. The reactions times were considerably faster in that case, averaging about 0.421 seconds for yellow lights. I suspect
@AlanSubie4Life is considering the latter scenario in estimating his own reaction time, not actually a scenario where the human is actually in traffic and not only looking for a light that is flashing (but rather processing a lot more other objects on the road).
https://csef.usc.edu/History/2004/Projects/J0332.pdf