FSD v12.x (end to end AI)

Southpasfan · Mar 1, 2024

MP3Mike said:
The visualizations in v12 are just eye candy for you. It doesn't appear that FSDb V12 uses them, or that data, for any driving purposes. (It responds to items not visualized, and it doesn't respond to items visualized that aren't actually there.)

Hmmm......

Well, I suppose I agree with you. For sure its not like the chain of reaction is (1) camera input - (2) object in video to code - (3) code to screen - (4) screen to physical reaction, its obviously (3) code to reaction and maybe like (3a) code to screen.

But I didn't know that there was an underlying universe of translated video the car was actually responding to.

The reason was I didn't occur to me that once the video is translated to code its really a nothingburger to throw it up on the screen in the car, that's not the advanced technology. I mean, if it can distinguish shopping carts coding the visualization of a shopping cart to the screen is easy, so why not do it.

Huh.

MP3Mike · Mar 1, 2024

KArnold said:
Even if just a visual clue, I like seeing that the machine properly sees/categorizes potential hazards. Even with today's 2-year-old ability to see, just basics today. But useful to me.

But is it actually helpful? What if the visualization networks detect and display everything, but the end-to-end driving network does not "see" the object such that it will attempt to drive through it, even though it is displayed to you? It seems to me it is of no actual value, as you can't rely on it responding to items displayed.

Southpasfan said:
For sure its not like the chain of reaction is (1) camera input - (2) object in video to code - (3) code to screen - (4) screen to physical reaction, its obviously (3) code to reaction and maybe like (3a) code to screen.

Right, but it is likely that step 3 and step 3a are using completely separate neural networks, both sharing the same camera input. So you can't count on step 3 reacting to what step 3a shows you.

So, maybe more like:

(1) camera input
- (2a) video to planner NN
  - (3a) physical reaction to step 2a
- (2b) video to visualization NN that performs object recognition
  - (3b) result from 2b to screen

Southpasfan said:
The reason was I didn't occur to me that once the video is translated to code its really a nothingburger to throw it up on the screen in the car, that's not the advanced technology. I mean, if it can distinguish shopping carts coding the visualization of a shopping cart to the screen is easy, so why not do it.

The actual end-to-end driving planner doesn't identify objects, it has no idea what a stop sign, or shopping cart, or person, or car, or truck, etc. is; that is all done just to display the eye candy to you. And currently the eye candy doesn't match the driving actions.

Supcom · Mar 1, 2024

MP3Mike said:
But is it actually helpful? What if the visualization networks detect and display everything, but the end-to-end driving network does not "see" the object such that it will attempt to drive through it, even though it is displayed to you? It seems to me it is of no actual value, as you can't rely on it responding to items displayed.

From what I have seen in videos, the only thing in the visualization that is actually useful is the path planner. That seems to match the car's intent. Everything else looks to be output from a separate process that is not in the driving pipeline. One video I saw showed the car running through a phantom pedestrian that was in the visualization.

Tronguy · Mar 1, 2024

I got this thought.

With V11 and previous, the NN hardware in a Tesla is busy doing image and object recognition. Now, suppose that one wants to see the surrounds on the screen. Right at the point between the NN image recog stuff and the C++ code that Elon and company talks about seems to be a location where, with a minimal amount of work, one can get the surrounds to appear.

Next, suppose we got V12, with photons in and steering/brakes/accelerator out. Hmm.. I take a wild guess and say that that obvious division that used to be there isn't there so much, especially as the C++ code is supposedly gone. Hence, a bunch of new, independent code looking at.. what variables? is now displaying something about the surrounds, but not at the same fidelity as what we used to get.

Comments?

MP3Mike · Mar 1, 2024

Tronguy said:
Next, suppose we got V12, with photons in and steering/brakes/accelerator out. Hmm.. I take a wild guess and say that that obvious division that used to be there isn't there so much, especially as the C++ code is supposedly gone. Hence, a bunch of new, independent code looking at.. what variables? is now displaying something about the surrounds, but not at the same fidelity as what we used to get.

Yep, my guess is that they are running a parallel NN to detect objects and get the visualizations. But the two will likely never match 100%.

Daniel in SD · Mar 1, 2024

Supcom said:
From what I have seen in videos, the only thing in the visualization that is actually useful is the path planner. That seems to match the car's intent. Everything else looks to be output from a separate process that is not in the driving pipeline. One video I saw showed the car running through a phantom pedestrian that was in the visualization.

That makes sense if in training they minimize the error between the path planner output and the subsequent control output.

Southpasfan · Mar 1, 2024

MP3Mike said:
But is it actually helpful? What if the visualization networks detect and display everything, but the end-to-end driving network does not "see" the object such that it will attempt to drive through it, even though it is displayed to you? It seems to me it is of no actual value, as you can't rely on it responding to items displayed.

Right, but it is likely that step 3 and step 3a are using completely separate neural networks, both sharing the same camera input. So you can't count on step 3 reacting to what step 3a shows you.

So, maybe more like:

(1) camera input

(2a) video to planner NN

(3a) physical reaction to step 2a

(2b) video to visualization NN that performs object recognition

(3b) result from 2b to screen

The actual end-to-end driving planner doesn't identify objects, it has no idea what a stop sign, or shopping cart, or person, or car, or truck, etc. is; that is all done just to display the eye candy to you. And currently the eye candy doesn't match the driving actions.

FYI, I have never seen my car drive through any object, the display has always only had limited things, but once those things popped up they became more and more accurate. And boy, in 2019 its ability to identify objects was really limited. Astounding progress since. But anyway.

My thought wasn't that the end to end planner "knows" what any of the objects are, its just that I thought as the software gets better it moves from something I think I saw in a Lex Friedman podcast which looked like the move "Tron" where all surfaces were represented by grids, to what you see on the display. In the "Tron" representation a tree was just a big squared-off blob which is better than nothing but always struck me as not accurate enough.

Thats why accurately rendered cones switching to blobs was so remarkable.

For example, let's say that the car needs to pass a parked bus, and the bus has those super extended side view mirrors that stick out like two feet.

Its not that it has to recognize that its a "bus with a side view mirror" but it does have to identify the side of the bus plus anything protruding from the bus.

For FSD to work, it needs to be accurate to some degree, what, maybe six inches? And from what I have seen it is that accurate, again, once it properly identifies the object it needs to be six inches away from.

I guess there could be the "Tron-like" underlayment, where it does not care what any object actually is, it only cares where that object is located and how fast its moving relative to the car.

But I always thought the difficulty was locating the object in the first place, not the display of it the driver sees as a dog vs. a deer vs. a human. I will say, that if it does not ever identify humans as different than other moving objects it will never be ready for prime time. But that's because humans need to be avoided at all costs.

Supcom · Mar 1, 2024

Daniel in SD said:
That makes sense if in training they minimize the error between the path planner output and the subsequent control output.

I think it's leftover. And, since it appears that V11 is still used on limited access roads, and AP is still available, it's still needed on the car.

But, once everything is V12-based, the visualization becomes superfluous, and might be done away with. Use the display space for something else.

EVNow · Mar 1, 2024

JB47394 said:
AI DRIVR's latest drive. Worth a watch.

This was interesting - looks like V12 does reasonably well in congested winding roads.

But one thing I have to note - he says V12 starts moving at intersections before the other car finishes. Even V11 has started doing that in the latest releases.

heltok · Mar 1, 2024

MP3Mike said:
But is it actually helpful? What if the visualization networks detect and display everything, but the end-to-end driving network does not "see" the object such that it will attempt to drive through it, even though it is displayed to you? It seems to me it is of no actual value, as you can't rely on it responding to items displayed.

Right, but it is likely that step 3 and step 3a are using completely separate neural networks, both sharing the same camera input. So you can't count on step 3 reacting to what step 3a shows you.

So, maybe more like:

(1) camera input

(2a) video to planner NN

(3a) physical reaction to step 2a

(2b) video to visualization NN that performs object recognition

(3b) result from 2b to screen

The actual end-to-end driving planner doesn't identify objects, it has no idea what a stop sign, or shopping cart, or person, or car, or truck, etc. is; that is all done just to display the eye candy to you. And currently the eye candy doesn't match the driving actions.

There are some visualizations that the car doesn't see. For example ghost cars being added for extra safety etc. The V12 end2end network doesn't see these, but it probably imagines equivalent features.

I think what Tesla are running is:
Photon count -> general world model base layers ->
A. V11 visualization
B. V12 control

The visualization project will over time improve to better reflect what the engineers think that the neural network is basing its decisions on.

powertoold · Mar 1, 2024

V12.2.1 on HW3 really struggles with "in between" decisions, where there are 2 sorta-equally decent decisions and it needs to choose one, like two backed up left turn lanes and it needs to choose one, it just wobbles between the two decisions and ends up needing a disengagement

kabin · Mar 1, 2024

JB47394 said:
AI DRIVR's latest drive. Worth a watch.

That was a good video. Those construction flag men show the challenges E2E faces. How many years before v12 is trained to handle an intersection with a police officer directing traffic via eye contact/arm/hand/whistle?

It seemed to respond a bit late to the one pedestrian (@10:00) carrying a box across the residential street. I would guess the blind curve played a role.

LowlyOilBurner · Mar 1, 2024

EVNow said:
This was interesting - looks like V12 does reasonably well in congested winding roads.

But one thing I have to note - he says V12 starts moving at intersections before the other car finishes. Even V11 has started doing that in the latest releases.

Sometimes. I’ve had it stall, and wait for the oncoming vehicle to completely clear the intersection before proceeding. I’ve found myself using the go pedal a lot more in 11.4.9 than previously. This is on HW4.

Mardak · Mar 1, 2024

Odd failure for a seemingly basic stay in lane with disengagement for crossing over. Perhaps the lighting and faded double yellow lines resulted in 12.2.1 thinking it was another left turn lane or maybe a center turn lane? In the past, Tesla had data collection triggers for entering/exiting tunnels presumably because they're confusing for some reason, so maybe underpasses could be confusing too?

It's hard to tell from the visualization, but seems like 11.x is confidently rendering double yellow lines, so maybe yet another example of 12.x not making use of 11.x perception? If so, it's somewhat unfortunate that 11.x's extensive training and solid understanding of what reasonable road layouts look like have to be relearned from scratch for end-to-end.

Supcom · Mar 1, 2024

EVNow said:
This was interesting - looks like V12 does reasonably well in congested winding roads.

But one thing I have to note - he says V12 starts moving at intersections before the other car finishes. Even V11 has started doing that in the latest releases.

I expect V12 will nail Chuck Cook's narrow lane test.

powertoold · Mar 1, 2024

12.3 is our last hope for HW3!!

https://twitter.com/x/status/1760169716654780625

jeremymc7 · Mar 1, 2024

They’re still heavily pushing 11.4.9. Who knows when we’ll see 12.3.x at all, let alone widely.

EVNow · Mar 1, 2024

heltok said:
There are some visualizations that the car doesn't see. For example ghost cars being added for extra safety etc. The V12 end2end network doesn't see these, but it probably imagines equivalent features.

I think what Tesla are running is:
Photon count -> general world model base layers ->
A. V11 visualization
B. V12 control

The visualization project will over time improve to better reflect what the engineers think that the neural network is basing its decisions on.

It’s not clear what the end to end is.it could simply be a replacement for the planner/ control and gets feeds from the perception NN.

JulienW · Mar 2, 2024

Supcom said:
I expect V12 will nail Chuck Cook's narrow lane test.

Poor Chuck may be the last to get v12 (or at least just one before me). Kingdom rules: Never speak ill of the King or there will be hell to pay.

mongo · Mar 2, 2024

EVNow said:
It’s not clear what the end to end is.it could simply be a replacement for the planner/ control and gets feeds from the perception NN.

It could, and that's likely what the first demos were.
However, that approach throws away a lot of intermediate data due to the final consolidation layer of the perception layer that the C++ code interfaced with.
Going from C++ evaluating 2000 (for example) object values to a NN evaluating 2000 object values is less capable than a NN evaluating the full layer above that classification layer (many, many more values).

It also adds the complication of needing to label objects for that layer and violates the camera to driving representation.

It is possible they started with the existing perception NN as-was and then unlocked the weights such that the original value to object corellations no longer exist but the planning NN didn't need to deal with quite so many inputs.

FSD v12.x (end to end AI)

Member

Well-Known Member

Active Member

Active Member

Well-Known Member

(supervised)

Member

Active Member

Well-Known Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Similar threads