FSD v12.x (end to end AI)

Phlier · Nov 27, 2023

Supcom said:
One person's rewrite is another's increment?

I am curious how visualization works with E2E. If the car is taking in photons and outputting vehicle control, with no teaching for identifying specific things like stop signs, other vehicles, etc, how will the system create the visualization? Perhaps some of the legacy perception NNs are kept to run alongside the E2E stuff?

Hmmmm. Maybe I'm wrong on this, but I thought the "no teaching" thing was limited to rules, ie, the car will stop at red lights because it has seen thousands of clips of other cars doing so. But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type (that may be a bit of an extreme example, as I'm sure there's still going to be some hard code for life or death situations).

So object recognition will still be at the heart of the system, but how it responds to those objects is what's changing.

Mardak · Nov 27, 2023

JHCCAZ said:
what Elon and Ashok explained: that there is no code for recognition of stop signs, traffic lights, lanes and so on

It's not totally clear whether they were referring specifically to planning/control or also perception. Looking through the YouTube transcript of the livestream, they do mention "no line of code" several times, e.g., "slow down for speed bumps," "give clearance to bicyclists," "stop at a stop sign or wait for another car." These are more clearly related to planning control for the car, but it's not as clear if they were also referring to the perception / recognition of these things.

Later there's examples of "doesn't know what a scooter is" and "it doesn't know what paddles are," and these comments are more related to perception. But even if existing 11.x object recognition neural networks are reused in a modular end-to-end approach, a new planning network using these as inputs technically don't need to know that some inputs correspond to vulnerable road users vs vehicles vs static objects and their related attributes -- all of these are just numbers internally and don't need labels of what they're for. For example, the planning network could get as one of its inputs some numbers that happen to correspond to a pedestrian in front of the vehicle, and the new network learns to stop based on other training examples of similar inputs without ever knowing that the values correspond to "pedestrian" or "in front." Perhaps another way to think about this is if the 11.x visualization shows something special, that's probably an output of perception.

One last example from the livestream: "we have never programmed in the concept of a roundabout" could actually be true for both perception and control as even with the existing 11.x perception, a roundabout is "just" some lanes/curbs shaped and connected in a certain way. So the new network picks up on those input signals and learns how to drive through roads shaped like roundabouts without any explicit signal that it's a roundabout.

diplomat33 · Nov 27, 2023

Phlier said:
Hmmmm. Maybe I'm wrong on this, but I thought the "no teaching" thing was limited to rules, ie, the car will stop at red lights because it has seen thousands of clips of other cars doing so. But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type (that may be a bit of an extreme example, as I'm sure there's still going to be some hard code for life or death situations).

So object recognition will still be at the heart of the system, but how it responds to those objects is what's changing.

I could be wrong but my understanding is that end-to-end still has perception. After all, you can't do autonomous driving without some sort of perception. The architecture of the perception is just different. End-to-end does not do the old style perception where humans manually label objects and you have an object detection NN module that draws 3D boxes around objects. Instead, end-to-end embeds the perception with the planning. So it "understands" that a group of pixels is a pedestrian or a car and knows from training how to respond when it sees that grouping of pixels behave in a certain way.

Mardak · Nov 27, 2023

Supcom said:
I am curious how visualization works with E2E. If the car is taking in photons and outputting vehicle control, with no teaching for identifying specific things like stop signs, other vehicles, etc, how will the system create the visualization?

For the vision general world model approach without explicit intermediate perception module, the network can internally learn about stop signs, vehicles, etc. even without explicit training and later a perception output head is finetuned to generate visualizations adjacent to the planning network that outputs controls. There's an example of segmenting the road vs vehicles vs curbs vs traffic lights vs signs vs grass vs sidewalk vs sky:

sleepydoc · Nov 27, 2023

Phlier said:
Hmmmm. Maybe I'm wrong on this, but I thought the "no teaching" thing was limited to rules, ie, the car will stop at red lights because it has seen thousands of clips of other cars doing so. But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type (that may be a bit of an extreme example, as I'm sure there's still going to be some hard code for life or death situations).

So object recognition will still be at the heart of the system, but how it responds to those objects is what's changing.

I was thinking about the 'end to end neural net' concept earlier and got stuck on something. It seems like this technique would work well for things like how to handle lane selection, etc. It's much less clear to me how it would be useful for things like Chuck's unprotected left turn.

There's an intersection near me with a stoplight and a sweeping right turn lane with a yield at the end. FSD has yet to handle this for anything except a green light (meaning it's essentially a right turn on green.) if the light is red it stops and won't proceed. For a human, it's easy - if the light is red you just make sure no one I is coming and keep driving. How do they train a neural net for this? There are many other subtle concepts that do not lend themselves to training by example.

Mardak · Nov 27, 2023

sleepydoc said:
There's an intersection near me with a stoplight and a sweeping right turn lane with a yield at the end. FSD has yet to handle this for anything except a green light

10.69 tried to fix this:

Increased smoothness for protected right turns by improving the association of traffic lights with slip lanes vs yield signs with slip lanes. This reduces false slowdowns when there are no relevant objects present and also improves yielding position when they are present.

Perhaps the control heuristics for handling these yield sign slip lanes does not match the perceived intersection layout in your situation or potentially perception is failing to detect the yield sign, association of traffic light, etc. The potential for end-to-end is that there will be examples of people driving without stopping on red lights for slip lanes similar to yours.

Mardak · Nov 27, 2023

AlanSubie4Life said:
What it most certainly does not mean is that there are only 3k lines of codes (obviously!)

I'm not sure it's so obvious given what you and others have commented. Andrej Karpathy has implemented large language model inference of Llama 2 in ~500 lines. Theoretically this program could even exceed the capabilities of GPT4 if given appropriate weights/parameters. Most of this code is converting input data into internal formats to do a bunch of math then converting to some output format, so for FSD with richer inputs and presumably more complicated internal neural network structure, expanding to ~3k lines of total code might not be that unreasonable.

This amount of code is probably not too different for either world model or modular end-to-end approaches, as even with modular, there's probably only a few pieces to connect together each with its own ~thousands of lines of glue.

sleepydoc · Nov 27, 2023

Mardak said:
10.69 tried to fix this:

Increased smoothness for protected right turns by improving the association of traffic lights with slip lanes vs yield signs with slip lanes. This reduces false slowdowns when there are no relevant objects present and also improves yielding position when they are present.

Perhaps the control heuristics for handling these yield sign slip lanes does not match the perceived intersection layout in your situation or potentially perception is failing to detect the yield sign, association of traffic light, etc. The potential for end-to-end is that there will be examples of people driving without stopping on red lights for slip lanes similar to yours.

It may have tried but it failed miserably, and not at this intersection; it’s a pattern repeated frequently at many intersections. This intersection is particularly bad not just because it stops when it doesn’t need to but because it won’t proceed at all - I have to take over.

Phlier · Nov 27, 2023

sleepydoc said:
There are many other subtle concepts that do not lend themselves to training by example.

Exactly. It'll be interesting to hear what they come up with to handle these events and situations. And there's so many of them, too; I don't think this is a matter of "chasing the long tail of nines," but rather how to handle common situations that just don't lend themselves to "do it by example."

Hope this whole thing doesn't end up being yet another one of those infamous "false horizons."

Phlier · Nov 27, 2023

Continuing my thoughts above, since I waited too long to edit the above post...

IMO, I think they're going to have to come up with two different ways to teach the machine: 1. Learning by example, which would be the thoroughly discussed "train it with a bunch of clips of people doing it right" method. 2. Learning by trial and error.

Have any of you guys/gals/other seen the hilarious clip of an AI learning to bowl? Scenario: setup a physics engine. Give the AI a rag doll that is capable of moving like a human. Set a goal for it to roll the ball down the lane and knock over the pins. Give it the rules of bowling. Hit the "Run" button.

What ensues is absolute hilarity... balls flying all over, rag dolls being exactly that. It's just hilarious. But eventually, after A LOT of trial and error, the thing learns to bowl. Just for emphasis, there is NO TRAINING involved here; it didn't have the luxury of watching a whole bunch of clips of people "doing it right."

Again IMO, there will be situations (possibly quite a lot of them) where learning by example just isn't going to work. They're going to need to setup the simulator for certain situations just like the above bowling example and just let the machine go nuts until it figures it out.

At a very basic level, this is how humans learn to drive. Teach the rules, show him/her/other how to do it, then let them learn the rest as they go. Since the goal is to have the machine drive like a human, I just don't see any other way... but then again, I'm not any sort of engineer or computer scientist and I could be completely wrong. I'm certainly no stranger to being wrong; I've been married for 30 years.

KArnold · Nov 27, 2023

Phlier said:
there will be situations (possibly quite a lot of them) where learning by example just isn't going to work.

What about the 1-off's? There's a UPL out of my neighborhood that has a somewhat unique setup, which FSDb fails today. There are many variations of a stop sign in NA - some are very infrequent. Or the "ladder in the road" scenario? If there's a million images of UPL's put into the sim, personally I doubt they would see one very close to mine.

I guess I'm saying, there will still be some 1-off situations, right? And like today do we wait for the "next release" for corrections? And if urgent, could that be done quickly?

MP3Mike · Nov 27, 2023

KArnold said:
If there's a million images of UPL's put into the sim, personally I doubt they would see one very close to mine.

Are you saying you have never attempted it on FSDb and aborted? (Such that they could upload data from your specific UPL.)

Mardak · Nov 27, 2023

Phlier said:
And there's so many of them, too; I don't think this is a matter of "chasing the long tail of nines," but rather how to handle common situations that just don't lend themselves to "do it by example."

Could you explain more of these common situations that you think end-to-end would have trouble learning from example? I didn't quite follow why the yield example with slip lane would be problematic either as presumably the network should learn to understand when it's correct to yield vs stop based on what it sees of the intersection layout, signage, etc.

sleepydoc · Nov 27, 2023

Phlier said:
Continuing my thoughts above, since I waited too long to edit the above post...

IMO, I think they're going to have to come up with two different ways to teach the machine: 1. Learning by example, which would be the thoroughly discussed "train it with a bunch of clips of people doing it right" method. 2. Learning by trial and error.

Have any of you guys/gals/other seen the hilarious clip of an AI learning to bowl? Scenario: setup a physics engine. Give the AI a rag doll that is capable of moving like a human. Set a goal for it to roll the ball down the lane and knock over the pins. Give it the rules of bowling. Hit the "Run" button.

What ensues is absolute hilarity... balls flying all over, rag dolls being exactly that. It's just hilarious. But eventually, after A LOT of trial and error, the thing learns to bowl. Just for emphasis, there is NO TRAINING involved here; it didn't have the luxury of watching a whole bunch of clips of people "doing it right."

Again IMO, there will be situations (possibly quite a lot of them) where learning by example just isn't going to work. They're going to need to setup the simulator for certain situations just like the above bowling example and just let the machine go nuts until it figures it out.

At a very basic level, this is how humans learn to drive. Teach the rules, show him/her/other how to do it, then let them learn the rest as they go. Since the goal is to have the machine drive like a human, I just don't see any other way... but then again, I'm not any sort of engineer or computer scientist and I could be completely wrong. I'm certainly no stranger to being wrong; I've been married for 30 years.

There's a third option - use a hybrid of programmed and 'learned' heuristics.

enemji · Nov 27, 2023

sleepydoc said:
It's much easier to test a geofenced system and the potential unknowns for a system that can operate anywhere are orders of magnitude greater.

If you've been to different areas of the country, you can appreciate the significant differences in roads, driving styles, etc. A system geofenced to one city only needs to be 'taught'/programed to deal with drivers and road styles of that city vs the entire country.

Just like NHSTA testing vehicles for crash worthiness, there would be an organization that would test and provide a rating for the driving under their curated set of situations. Insurance companies can use that to quote the insurance.

KArnold · Nov 27, 2023

MP3Mike said:
Are you saying you have never attempted it on FSDb and aborted? (Such that they could upload data from your specific UPL.)

Of course, every time I drive. Don't see how that's relevant.

MP3Mike · Nov 27, 2023

KArnold said:
Of course, every time I drive. Don't see how that's relevant.

Because that would give them the data to put in training that you said you didn't think they would ever have.

EVNow · Nov 27, 2023

Mardak said:
One last example from the livestream: "we have never programmed in the concept of a roundabout" could actually be true for both perception and control as even with the existing 11.x perception, a roundabout is "just" some lanes/curbs shaped and connected in a certain way. So the new network picks up on those input signals and learns how to drive through roads shaped like roundabouts without any explicit signal that it's a roundabout.

Yes - this is why roundabouts still suck

Will other things be as bad ?

Terrible example, IMO !

EVNow · Nov 27, 2023

Phlier said:
But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type

Not really.

It's not like the code "understands" what a human is, anyway.

Basically there is no requirement to specifically first identify objects and then do controls. They can all be put into a gigantic NN with "photons" as input and car controls as output

KArnold · Nov 27, 2023

MP3Mike said:
Because that would give them the data to put in training that you said you didn't think they would ever have.

Ok. I'm trying to better understand.

I believe I'm likely the only Tesla to make this unique UPL from my neighborhood. Let's say that's once/day. It's always on FSDb. Two-thirds of the attempts (when there is traffic) it is an intervention. One-third (no traffic) it works great. Where would they get a video of it being done correctly with traffic?

So Tesla has 360 FSD UPL's at that unique location from me and in total. Out of the million examples they may have for a UPL, it seems unlikely they would have any successful versions of it being done properly with traffic.

And really we could bump up a level. What I'm really curious about is how one-off's (we will have those, yes?) will be handled and redistributed? Wait for next release like today?

FSD v12.x (end to end AI)

Bluebird

Active Member

Average guy who loves autonomous vehicles

Active Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Bluebird

Bluebird

Active Member

Well-Known Member

Active Member

Well-Known Member

Active Member

Active Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Similar threads