Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
One person's rewrite is another's increment?

I am curious how visualization works with E2E. If the car is taking in photons and outputting vehicle control, with no teaching for identifying specific things like stop signs, other vehicles, etc, how will the system create the visualization? Perhaps some of the legacy perception NNs are kept to run alongside the E2E stuff?
Hmmmm. Maybe I'm wrong on this, but I thought the "no teaching" thing was limited to rules, ie, the car will stop at red lights because it has seen thousands of clips of other cars doing so. But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type (that may be a bit of an extreme example, as I'm sure there's still going to be some hard code for life or death situations).

So object recognition will still be at the heart of the system, but how it responds to those objects is what's changing.
 
what Elon and Ashok explained: that there is no code for recognition of stop signs, traffic lights, lanes and so on
It's not totally clear whether they were referring specifically to planning/control or also perception. Looking through the YouTube transcript of the livestream, they do mention "no line of code" several times, e.g., "slow down for speed bumps," "give clearance to bicyclists," "stop at a stop sign or wait for another car." These are more clearly related to planning control for the car, but it's not as clear if they were also referring to the perception / recognition of these things.

Later there's examples of "doesn't know what a scooter is" and "it doesn't know what paddles are," and these comments are more related to perception. But even if existing 11.x object recognition neural networks are reused in a modular end-to-end approach, a new planning network using these as inputs technically don't need to know that some inputs correspond to vulnerable road users vs vehicles vs static objects and their related attributes -- all of these are just numbers internally and don't need labels of what they're for. For example, the planning network could get as one of its inputs some numbers that happen to correspond to a pedestrian in front of the vehicle, and the new network learns to stop based on other training examples of similar inputs without ever knowing that the values correspond to "pedestrian" or "in front." Perhaps another way to think about this is if the 11.x visualization shows something special, that's probably an output of perception.

One last example from the livestream: "we have never programmed in the concept of a roundabout" could actually be true for both perception and control as even with the existing 11.x perception, a roundabout is "just" some lanes/curbs shaped and connected in a certain way. So the new network picks up on those input signals and learns how to drive through roads shaped like roundabouts without any explicit signal that it's a roundabout.
 
Hmmmm. Maybe I'm wrong on this, but I thought the "no teaching" thing was limited to rules, ie, the car will stop at red lights because it has seen thousands of clips of other cars doing so. But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type (that may be a bit of an extreme example, as I'm sure there's still going to be some hard code for life or death situations).

So object recognition will still be at the heart of the system, but how it responds to those objects is what's changing.

I could be wrong but my understanding is that end-to-end still has perception. After all, you can't do autonomous driving without some sort of perception. The architecture of the perception is just different. End-to-end does not do the old style perception where humans manually label objects and you have an object detection NN module that draws 3D boxes around objects. Instead, end-to-end embeds the perception with the planning. So it "understands" that a group of pixels is a pedestrian or a car and knows from training how to respond when it sees that grouping of pixels behave in a certain way.
 
  • Like
Reactions: jebinc
I am curious how visualization works with E2E. If the car is taking in photons and outputting vehicle control, with no teaching for identifying specific things like stop signs, other vehicles, etc, how will the system create the visualization?
For the vision general world model approach without explicit intermediate perception module, the network can internally learn about stop signs, vehicles, etc. even without explicit training and later a perception output head is finetuned to generate visualizations adjacent to the planning network that outputs controls. There's an example of segmenting the road vs vehicles vs curbs vs traffic lights vs signs vs grass vs sidewalk vs sky:

world model panoptic segmentation.png
 
Hmmmm. Maybe I'm wrong on this, but I thought the "no teaching" thing was limited to rules, ie, the car will stop at red lights because it has seen thousands of clips of other cars doing so. But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type (that may be a bit of an extreme example, as I'm sure there's still going to be some hard code for life or death situations).

So object recognition will still be at the heart of the system, but how it responds to those objects is what's changing.
I was thinking about the 'end to end neural net' concept earlier and got stuck on something. It seems like this technique would work well for things like how to handle lane selection, etc. It's much less clear to me how it would be useful for things like Chuck's unprotected left turn.

There's an intersection near me with a stoplight and a sweeping right turn lane with a yield at the end. FSD has yet to handle this for anything except a green light (meaning it's essentially a right turn on green.) if the light is red it stops and won't proceed. For a human, it's easy - if the light is red you just make sure no one I is coming and keep driving. How do they train a neural net for this? There are many other subtle concepts that do not lend themselves to training by example.
 
There's an intersection near me with a stoplight and a sweeping right turn lane with a yield at the end. FSD has yet to handle this for anything except a green light
10.69 tried to fix this:
  • Increased smoothness for protected right turns by improving the association of traffic lights with slip lanes vs yield signs with slip lanes. This reduces false slowdowns when there are no relevant objects present and also improves yielding position when they are present.
Perhaps the control heuristics for handling these yield sign slip lanes does not match the perceived intersection layout in your situation or potentially perception is failing to detect the yield sign, association of traffic light, etc. The potential for end-to-end is that there will be examples of people driving without stopping on red lights for slip lanes similar to yours.
 
  • Informative
Reactions: JB47394
What it most certainly does not mean is that there are only 3k lines of codes (obviously!)
I'm not sure it's so obvious given what you and others have commented. Andrej Karpathy has implemented large language model inference of Llama 2 in ~500 lines. Theoretically this program could even exceed the capabilities of GPT4 if given appropriate weights/parameters. Most of this code is converting input data into internal formats to do a bunch of math then converting to some output format, so for FSD with richer inputs and presumably more complicated internal neural network structure, expanding to ~3k lines of total code might not be that unreasonable.

This amount of code is probably not too different for either world model or modular end-to-end approaches, as even with modular, there's probably only a few pieces to connect together each with its own ~thousands of lines of glue.
 
10.69 tried to fix this:
  • Increased smoothness for protected right turns by improving the association of traffic lights with slip lanes vs yield signs with slip lanes. This reduces false slowdowns when there are no relevant objects present and also improves yielding position when they are present.
Perhaps the control heuristics for handling these yield sign slip lanes does not match the perceived intersection layout in your situation or potentially perception is failing to detect the yield sign, association of traffic light, etc. The potential for end-to-end is that there will be examples of people driving without stopping on red lights for slip lanes similar to yours.
It may have tried but it failed miserably, and not at this intersection; it’s a pattern repeated frequently at many intersections. This intersection is particularly bad not just because it stops when it doesn’t need to but because it won’t proceed at all - I have to take over.
 
There are many other subtle concepts that do not lend themselves to training by example.
Exactly. It'll be interesting to hear what they come up with to handle these events and situations. And there's so many of them, too; I don't think this is a matter of "chasing the long tail of nines," but rather how to handle common situations that just don't lend themselves to "do it by example."

Hope this whole thing doesn't end up being yet another one of those infamous "false horizons."
 
Continuing my thoughts above, since I waited too long to edit the above post...

IMO, I think they're going to have to come up with two different ways to teach the machine: 1. Learning by example, which would be the thoroughly discussed "train it with a bunch of clips of people doing it right" method. 2. Learning by trial and error.

Have any of you guys/gals/other seen the hilarious clip of an AI learning to bowl? Scenario: setup a physics engine. Give the AI a rag doll that is capable of moving like a human. Set a goal for it to roll the ball down the lane and knock over the pins. Give it the rules of bowling. Hit the "Run" button.

What ensues is absolute hilarity... balls flying all over, rag dolls being exactly that. It's just hilarious. But eventually, after A LOT of trial and error, the thing learns to bowl. Just for emphasis, there is NO TRAINING involved here; it didn't have the luxury of watching a whole bunch of clips of people "doing it right."

Again IMO, there will be situations (possibly quite a lot of them) where learning by example just isn't going to work. They're going to need to setup the simulator for certain situations just like the above bowling example and just let the machine go nuts until it figures it out.

At a very basic level, this is how humans learn to drive. Teach the rules, show him/her/other how to do it, then let them learn the rest as they go. Since the goal is to have the machine drive like a human, I just don't see any other way... but then again, I'm not any sort of engineer or computer scientist and I could be completely wrong. I'm certainly no stranger to being wrong; I've been married for 30 years.
 
there will be situations (possibly quite a lot of them) where learning by example just isn't going to work.
What about the 1-off's? There's a UPL out of my neighborhood that has a somewhat unique setup, which FSDb fails today. There are many variations of a stop sign in NA - some are very infrequent. Or the "ladder in the road" scenario? If there's a million images of UPL's put into the sim, personally I doubt they would see one very close to mine.

I guess I'm saying, there will still be some 1-off situations, right? And like today do we wait for the "next release" for corrections? And if urgent, could that be done quickly?
 
And there's so many of them, too; I don't think this is a matter of "chasing the long tail of nines," but rather how to handle common situations that just don't lend themselves to "do it by example."
Could you explain more of these common situations that you think end-to-end would have trouble learning from example? I didn't quite follow why the yield example with slip lane would be problematic either as presumably the network should learn to understand when it's correct to yield vs stop based on what it sees of the intersection layout, signage, etc.
 
Continuing my thoughts above, since I waited too long to edit the above post...

IMO, I think they're going to have to come up with two different ways to teach the machine: 1. Learning by example, which would be the thoroughly discussed "train it with a bunch of clips of people doing it right" method. 2. Learning by trial and error.

Have any of you guys/gals/other seen the hilarious clip of an AI learning to bowl? Scenario: setup a physics engine. Give the AI a rag doll that is capable of moving like a human. Set a goal for it to roll the ball down the lane and knock over the pins. Give it the rules of bowling. Hit the "Run" button.

What ensues is absolute hilarity... balls flying all over, rag dolls being exactly that. It's just hilarious. But eventually, after A LOT of trial and error, the thing learns to bowl. Just for emphasis, there is NO TRAINING involved here; it didn't have the luxury of watching a whole bunch of clips of people "doing it right."

Again IMO, there will be situations (possibly quite a lot of them) where learning by example just isn't going to work. They're going to need to setup the simulator for certain situations just like the above bowling example and just let the machine go nuts until it figures it out.

At a very basic level, this is how humans learn to drive. Teach the rules, show him/her/other how to do it, then let them learn the rest as they go. Since the goal is to have the machine drive like a human, I just don't see any other way... but then again, I'm not any sort of engineer or computer scientist and I could be completely wrong. I'm certainly no stranger to being wrong; I've been married for 30 years.
There's a third option - use a hybrid of programmed and 'learned' heuristics.
 
It's much easier to test a geofenced system and the potential unknowns for a system that can operate anywhere are orders of magnitude greater.

If you've been to different areas of the country, you can appreciate the significant differences in roads, driving styles, etc. A system geofenced to one city only needs to be 'taught'/programed to deal with drivers and road styles of that city vs the entire country.
Just like NHSTA testing vehicles for crash worthiness, there would be an organization that would test and provide a rating for the driving under their curated set of situations. Insurance companies can use that to quote the insurance.
 
One last example from the livestream: "we have never programmed in the concept of a roundabout" could actually be true for both perception and control as even with the existing 11.x perception, a roundabout is "just" some lanes/curbs shaped and connected in a certain way. So the new network picks up on those input signals and learns how to drive through roads shaped like roundabouts without any explicit signal that it's a roundabout.
Yes - this is why roundabouts still suck ;)

Will other things be as bad ?

Terrible example, IMO !
 
But in order to do this, the car will still need to be able to identify objects; it will need to be able to identify a human in the road so it can then refer to the clips that show other cars stopping for this object type
Not really.

It's not like the code "understands" what a human is, anyway.

Basically there is no requirement to specifically first identify objects and then do controls. They can all be put into a gigantic NN with "photons" as input and car controls as output
 
Because that would give them the data to put in training that you said you didn't think they would ever have.
Ok. I'm trying to better understand.

I believe I'm likely the only Tesla to make this unique UPL from my neighborhood. Let's say that's once/day. It's always on FSDb. Two-thirds of the attempts (when there is traffic) it is an intervention. One-third (no traffic) it works great. Where would they get a video of it being done correctly with traffic?

So Tesla has 360 FSD UPL's at that unique location from me and in total. Out of the million examples they may have for a UPL, it seems unlikely they would have any successful versions of it being done properly with traffic.

And really we could bump up a level. What I'm really curious about is how one-off's (we will have those, yes?) will be handled and redistributed? Wait for next release like today?