FSD v12.x (end to end AI)

kabin · Oct 5, 2023

Captkerosene said:
My question for the bored: Does Tesla/Elon have enough information now to know if FSD can be achieved with current tech and planned Dojo upgrades? Can they define some very difficult diverse problems, train for that, and then extrapolate the results across the full spectrum of issues to know with any certainty if they will succeed in full autonomy without any more breakthroughs?

Let's put it this way. Aside from miracles, if the goal is profits and easy production FSD will continue to fall short of requirements. On the bright side, Elon can continue to overstate capabilities as long as investors are willing to play the game.

spacecoin · Oct 5, 2023

D Good said:
You can easily see the drivers hands manipulate the steering wheel multiple times on these turns. They are most likely collecting training data, or using the cameras to 3-d model the turn for simulations.

So they have to physically go there to collect data now? What about that massive fleet? Chuck drives there every day and probably a few other Teslas too.

JB47394 · Oct 5, 2023

DrChaos said:
How do you know from raw sensor data what "the driving situation" is at any time?

Raw sensor data would be handled by a perception network to crank out objects. This is the same as our manner of identifying significant objects to any task. That then becomes the basic "driving situation". Higher order semantics about the "driving situation" are then determined by the use of other networks. The perception network might actually be multi-level, relying on identification of lines, circles and shapes first, then using a second network to spot wheels, bicycles, cars, legs, arms, and so on. There may be other decompositions which would provide greater mileage, especially if the same networks are supposed to support something like Optimus.

DrChaos said:
Is the "driving situation" something human labelled?

Yes. That's the way we understand it, so that's the way we would teach it to a learning automaton. In doing so, we can hope to instrument the thing to learn what it is doing and why because the semantic transitions between the networks are exposed and familiar.

DrChaos said:
How do you get that, what happens if there is more than one at the same time, unprotected left with a child in the crosswalk, and someone swerving into your lane?

That would be the highest semantic level and would likely be learned entirely by training because there are no semantics that we have developed other than "Do the right thing". That training would be based on high level notions such as "executing unprotected left", "child" and "crosswalk" as opposed to "Here are some pixels". If somebody who was looking hard at this found higher level semantics that could be trained and used by other networks, then so much the better.

I prefer this sort of controlled and structured approach because something like an LLM just screams hubris. "We don't understand everything about it, but it does amazing stuff, so we're gonna ship it." I have the same problem with modern software engineers using platforms, packages, virtual machines and so on; they don't know what it is that they're creating, but it does what they want, so they sell it. Later they find out that it does other things that they didn't want.

kabin · Oct 5, 2023

D Good said:
You can easily see the drivers hands manipulate the steering wheel multiple times on these turns. They are most likely collecting training data, or using the cameras to 3-d model the turn for simulations.

Yep! Hands at 10-2 steering wheel position and hands follow steering wheel in the turn. Doesn't pull into oncoming traffic. After the turn occasionally isn't centered in the lane. It sure as heck isn't FSD v11.

Chuck really wants it to be v12.

DrChaos · Oct 5, 2023

JB47394 said:
Raw sensor data would be handled by a perception network to crank out objects. This is the same as our manner of identifying significant objects to any task. That then becomes the basic "driving situation". Higher order semantics about the "driving situation" are then determined by the use of other networks.

How? There's a giant technological problem embedded in that passive voice. Who defines the span of space of "driving situtation"? Where do the ground truth inputs come from? What if there are multiple situations?

JB47394 said:
The perception network might actually be multi-level, relying on identification of lines, circles and shapes first, then using a second network to spot wheels, bicycles, cars, legs, arms, and so on.

That's literally what the system Karpathy and his people designed prior to the V12 E2E. That still took 5 years. The very lowest level levels in visual perception in CNNs automatically learn things like edge and texture detectors when trained to detect/classify higher level things, but that's been known ML results for at least 15 years.

JB47394 said:
There may be other decompositions which would provide greater mileage, especially if the same networks are supposed to support something like Optimus.

Yes. That's the way we understand it, so that's the way we would teach it to a learning automaton. In doing so, we can hope to instrument the thing to learn what it is doing and why because the semantic transitions between the networks are exposed and familiar.

That would be the highest semantic level and would likely be learned entirely by training because there are no semantics that we have developed other than "Do the right thing". That training would be based on high level notions such as "executing unprotected left", "child" and "crosswalk" as opposed to "Here are some pixels". If somebody who was looking hard at this found higher level semantics that could be trained and used by other networks, then so much the better.

The previous system used neural networks on perception against large human-labelled databases and simulated data, then added on 300K lines of hard code for the drive policy. It still had many problems

JB47394 said:
I prefer this sort of controlled and structured approach because something like an LLM just screams hubris. "We don't understand everything about it, but it does amazing stuff, so we're gonna ship it."

That's what we have been discussing---the lack of controllability in an End-to-end network where that includes policy. So far we have no idea if they're even going that way for policy. So far the E2E training they've shown is generative (like Wayve). That might not actually end up something useful for deployment but a better way of creating simulated scenarios that meld various known factors (time, weather, traffic) to generate remixes of plausible driving scenarios with better fidelity and realism than hand-coded simulation.

But that's just a first step to training policy.

JB47394 said:
I have the same problem with modern software engineers using platforms, packages, virtual machines and so on; they don't know what it is that they're creating, but it does what they want, so they sell it. Later they find out that it does other things that they didn't want.

Every neural network will do that.

willow_hiller · Oct 5, 2023

kabin said:
Yep! Hands at 10-2 steering wheel position and hands follow steering wheel in the turn. It sure as heck isn't FSD v11.

You have an Olympic-level gymnast in your head to simultaneously believe that:

1. The driving behavior is so jerky, unnatural, and clear sign that V12 is no better than V11

and

2. It's obviously being driven manually

AlanSubie4Life · Oct 5, 2023

D Good said:
You can easily see the drivers hands manipulate the steering wheel multiple times on these turns

I’m not sure how you would distinguish driving from ADAS unless there was an intervention. Driver’s hands certainly should be at 9 and 3 or so consistently when FSD is engaged. Definitely there should be a LOT of clear manipulation of the steering wheel when the car is “driving!”

But anyway it could be manual control, I have no idea. It is hard to tell. Is there a specific time stamp? I did not watch in detail obviously.

kabin · Oct 5, 2023

willow_hiller said:
You have an Olympic-level gymnast in your head to simultaneously believe that:

1. The driving behavior is so jerky, unnatural, and clear sign that V12 is no better than V11

and

2. It's obviously being driven manually

No, I didn't say 1) to the second video. Also, I never suggested "v12 was no better than v11." (moderator edit)

willow_hiller · Oct 5, 2023

kabin said:
Also, I never suggested "v12 was no better than v11." Let's stay focused.

kabin said:
Gotta hope that isn't v12. Stop short, indecisive approach to stop sign, and jerky braking/lunging when deciding when/if to proceed. I bet the driver hit the brakes at the end when that semi approached. Looks like v11 Junk to me.

Really?

mongo · Oct 5, 2023

JB47394 said:
I'm on the same page as Douma about V12 today, expecting it to be V11 with a neural network for the control module. Anything else seems idiotic at multiple levels.

I think that was the April demo to Elon that Walter talks about in the book.

JB47394 said:
In my opinion, the right end of the spectrum involves decomposing the control module into a hierarchy of networks. Each could be trained separately and, at runtime, only the pertinent networks would need to be calculated (as established by earlier networks in the hierarchy).

Except that compartmentalization loses extra data from the previous levels.

mongo · Oct 5, 2023

willow_hiller said:
Really?

They were saying 'if this is v12 it's as bad as v11' not 'v12 is as bad as v11'

willow_hiller · Oct 5, 2023

mongo said:
They were saying 'if this is v12 it's as bad as v11' not 'v12 is as bad as v11'

If it's V12, then it's as bad as V11. But it's definitely being manually driven, so manual driving is as bad as V11. So V11 is on par with human performance, got it.

mongo · Oct 5, 2023

willow_hiller said:
If it's V12, then it's as bad as V11. But it's definitely being manually driven, so manual driving is as bad as V11. So V11 is on par with human performance, got it.

I'm not discussing the video, just the post you quoted.

willow_hiller · Oct 5, 2023

mongo said:
I'm not discussing the video, just the post you quoted.

And I'm pointing out that the post I quoted contains circular logic that doesn't add up. There are two separate videos, but it doesn't make sense that the first time Chuck captured a video, it was as bad as "v11 Junk," but every subsequent time it was being driven by a human.

Either it's so bad that it reflects poorly on FSD, or it's too good that it must be human. It can't be both.

legendsk · Oct 5, 2023

willow_hiller said:
What video are you watching? None of what you describe is shown.

What screen are you watching? They all show on my screen.

DrChaos · Oct 5, 2023

mongo said:
I think that was the April demo to Elon that Walter talks about in the book.

Except that compartmentalization loses extra data from the previous levels.

I think the idea of end-to-end *training* (not just forward evaluation/inference) is that backprop signals in a later layer (policy) can backprop back to initial perception layers and modify them to better satisfy the needs of the later layers. That is what George Hotz (OpenPilot founder) advocated a while ago, though possibly as a long term ideal.

legendsk · Oct 5, 2023

I have a suggestion for Tesla. If training my Y to drive itself is so difficult that after, what, 6 years, it is still so far away - and training Optimus is so easy that in only 2 months they can sort blocks without stepping on them, kicking them or tripping over them, then offer to trade me my paid-for FSD for an Optimus, even up. As soon as it learns to drive I will be in business. Until then I will happily settle for it making waffles, dusting, changing the tv channels or whatever else it learns to do and I will be glad to help teach it things, (such as driving), as long as it doesn't try to kill me.

animorph · Oct 5, 2023

I guess Chuck's video could be a human driver laying down training examples for v12 (with full stops as required!)

I felt the driving was a little too consistent for a human, though that might be a goal for a training video as well. Better than I could do.

I'm pretty sure it won't goose the accelerator, in order to maintain the appropriate grandma acceleration limitations. Rather than try to fit into a smaller than normal gap in the traffic, it will wait as long as necessary (or programmed) for the right size gap to open up. It takes some human impatience to discomfort your passenger and the dog in the back seat with greater than normal acceleration.

powertoold · Oct 5, 2023

Tesla isn't going to spend the resources for one guy to go drive Chuck's turn X-number of times for V12 training data.

V12 training data is based on a fleet of millions of drivers. It's not difficult to gather good examples from the fleet.

Tesla is there to test their software, not create training examples.

AlanSubie4Life · Oct 5, 2023

legendsk said:
As soon as it learns to drive I will be in business.

Yes, it was mentioned here after the Optimus reveal that this was the killer app. No hardware upgrades needed on the cars!

FSD v12.x (end to end AI)

Active Member

Active Member

Active Member

Active Member

Active Member

Well-Known Member

Efficiency Obsessed Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Member

Active Member

Member

Active Member

Active Member

Efficiency Obsessed Member

Similar threads