Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Enhanced Summon, where are you?

This site may earn commission on affiliate links.
New video:
Wow. Pretty irresponsible to be doing this in a parking lot with other cars active.

Blows through a stop sign at 5:25 while driving on the left. In fact most of the time is spent on the left even with arrows on the road indicating the opposite direction.

Can see why the technical lead for summon just left. A fresh team should be able to get it working in no time at all. Looks good for a worldwide release next month. Really...
 
Last edited:
So it doesn't care about stop signs yet? In such low speed and that close up it should be detecting it. No sign reading involved during advanced summon?

No, it seems Tesla has not implemented stop signs yet.

As the video alludes to, it seems that the current enhanced summon is really just an exercise in "free space path finding". The focus is on path finding and recognizing the "driveable space". Tesla is training the vision NN to know that this area of asphalt is driveable but that curb or grassy area is not driveable. This is a necessary first step especially for Tesla that relies on cameras for perception. Before you can get to road signs and such, the car has to know where it can drive. And again, Tesla has to rely exclusively on cameras for this which is why it is taking time to train the vision NN. The current Enhanced Summon is primitive but over time, it will get better. Once Tesla nails the "free space pathfinding" part, they can incorporate stop signs, traffic rules and auto park and then enhanced summon will be able to self-drive and self-park on its own.
 
  • Funny
Reactions: AlanSubie4Life
No, it seems Tesla has not implemented stop signs yet.

As the video alludes to, it seems that the current enhanced summon is really just an exercise in "free space path finding". The focus is on path finding and recognizing the "driveable space". Tesla is training the vision NN to know that this area of asphalt is driveable but that curb or grassy area is not driveable. This is a necessary first step especially for Tesla that relies on cameras for perception. Before you can get to road signs and such, the car has to know where it can drive. And again, Tesla has to rely exclusively on cameras for this which is why it is taking time to train the vision NN. The current Enhanced Summon is primitive but over time, it will get better. Once Tesla nails the "free space pathfinding" part, they can incorporate stop signs, traffic rules and auto park and then enhanced summon will be able to self-drive and self-park on its own.
I keep feeling the City NOA team and Enh summon team are not in talking terms. We know, for months the car can recognize stop signs and even traffic lights. But it has not been incorporated into Summon.

My take is - Musk (and others in the team) thought summon was very simple and had just a handful of people working on it with procedural code. They have faced endless stream of difficulties, as one would expect and had to train new NN tasks to identify things like grass or curb.

What I'd like to see is "DoJo" work on this summon in a parking lot problem. If e2e is ever going to work, summon would be a very good test.
 
  • Like
Reactions: diplomat33
New video:

Seems to me - the car is going fast enough now. Still hesitates at points for no reason. Still has some work to do with driving area recognition - and ofcourse a lot more work needed to drive on the right side etc.

BTW, I should say it is not at all uncommon to see people drive on the wrong side or drive across parking spots etc in a parking lot.
 
What I'd like to see is "DoJo" work on this summon in a parking lot problem. If e2e is ever going to work, summon would be a very good test.

I suspect that for DoJo to work, the car first needs excellent vision. After all, if the car is wrong on what it thinks it is seeing, that will throw everything else off. My understanding is that once vision is good enough, Tesla can feed massive data from the fleet of steering and pedal controls matched with vision to train the NN. In essence, the machine will learn what steering controls should be done based on what it sees. For example, if I see a lane merge up ahead and traffic around me doing this, I should do move the steering wheel this much and press the accelerator/brake pedal this way. Hotz mentions in his interview that with enough data, you can eliminate the "bad drivers" because they will be outliers. The data from "good drivers" will clump together. So you eliminate the outliers from the bad drivers and only use the data from the good drivers to train the NN.

If my understanding of DoJo is correct, it should allow Tesla to replace the hard coded latitudinal and longitudinal controls with a NN and thereby have a NN control all aspects of driving from perception to steering and braking. So handling lane changes, lane merges, etc can be handled more smoothly and intelligently by the car. It will be exciting to see what DoJo is able to contribute to FSD.
 
Here is an interesting comment from a EAP owner on Reddit. If this is kind of reaction is widespread, we won't be able to use summon in anything but empty parking lots.

You say that now, and generally it’s true. Casual observers are amazed. But when your driverless car is rolling toward another driver who just rounded a corner and they panic and throw their car into reverse to avoid the oncoming unmanned vehicle, there’s a sobering moment when you realize there’s more to Enhanced Summon than just the car avoiding obstacles to find it’s way to you. Unfortunately, until Summon can use proper parking lot etiquette, people’s reactions to it can be embarrassing (for the summoner) and unsafe.
Enhanced Summon "v4" is looking MUCH better. : teslamotors
 
  • Informative
Reactions: diplomat33
I suspect that for DoJo to work, the car first needs excellent vision. After all, if the car is wrong on what it thinks it is seeing, that will throw everything else off. My understanding is that once vision is good enough, Tesla can feed massive data from the fleet of steering and pedal controls matched with vision to train the NN.
No - the way e2e would work is - you just feed the sensor outputs and driver action. You don't feed the current vision NN output as the input. You basically don't assume anything (no lanes, no driving paths, no other cars - nothing, just raw pixels) and let the NN figure it out. Obviously something of this complexity hasn't been attempted before, so no one knows whether it would work, how long it would take, how much training data is needed etc.
 
  • Helpful
Reactions: diplomat33
No - the way e2e would work is - you just feed the sensor outputs and driver action. You don't feed the current vision NN output as the input. You basically don't assume anything (no lanes, no driving paths, no other cars - nothing, just raw pixels) and let the NN figure it out. Obviously something of this complexity hasn't been attempted before, so no one knows whether it would work, how long it would take, how much training data is needed etc.

Thanks for the explanation. Based on what you wrote, e2e actually is learning perception and driving controls at the same time because it is learning to interpret what the raw pixels mean combined with steering controls. Personally, I think that might be going too far. After all, if you've already done all the perception work, why start from scratch?

So, I think it would make sense to feed the vision NN with driver actions. That way, the system can get what the car is seeing combined with what the driver actions are, to learn how to drive.
 
Personally, I think that might be going too far. After all, if you've already done all the perception work, why start from scratch?
Because you don't know whether the work already done is the correct thing.

Take for eg, the lanes thing. How do you handle roads with no markings - like in parking lots ? Now you have to introduce imaginary lanes. Then how do you handle double parking ? You have to introduce crossing lanes in "some scenarios".

Anyway, as the name implies, e2e is end to end. Sensor data as input - trained by incentivizing matching known driver behavior.

BTW, I'm yet to read that nvideo paper linking in DoJo thread.
 
  • Helpful
Reactions: diplomat33
Because you don't know whether the work already done is the correct thing.

Take for eg, the lanes thing. How do you handle roads with no markings - like in parking lots ? Now you have to introduce imaginary lanes. Then how do you handle double parking ? You have to introduce crossing lanes in "some scenarios".

Anyway, as the name implies, e2e is end to end. Sensor data as input - trained by incentivizing matching known driver behavior.

BTW, I'm yet to read that nvideo paper linking in DoJo thread.

Thanks. But I don't think I am understanding e2e. For example, the car sees a certain pattern of raw pixels and the driver moved the steering wheel to cross a line of pixels (lane change). But the car does not know that those pixels are lane lines or that those other pixels are cars. And just because I see that pattern of pixels does not mean I should automatically make that action (lane change). It feels like driving blind. It feels to me like it is lacking context which is critical to driving. So I am not seeing how e2e works to solve FSD.
 
Thanks. But I don't think I am understanding e2e. For example, the car sees a certain pattern of raw pixels and the driver moved the steering wheel to cross a line of pixels (lane change). But the car does not know that those pixels are lane lines or that those other pixels are cars. And just because I see that pattern of pixels does not mean I should automatically make that action (lane change). It feels like driving blind. It feels to me like it is lacking context which is critical to driving. So I am not seeing how e2e works to solve FSD.

This is a toy example, but it's an interesting demonstration of unsupervised machine learning

The only inputs are the pixel graphics from the game, along with a reward (points) for making forward progress in the desired direction. With only those inputs, the NN learns amazingly complex driving behavior.
 
This is a toy example, but it's an interesting demonstration of unsupervised machine learning

The only inputs are the pixel graphics from the game, along with a reward (points) for making forward progress in the desired direction. With only those inputs, the NN learns amazingly complex driving behavior.

I think I get it now. Basically, e2e matches the driving controls with the sensor input to learn driving. It might not know what lane lines are or what a car is but it could discern from the steering controls that the car is keeping in between two parallel lines (lane keeping) or steering to cross one line and centering between the two adjacent parallel lines (lane change). So it could learn different driving behavior.
 
I think I get it now. Basically, e2e matches the driving controls with the sensor input to learn driving. It might not know what lane lines are or what a car is but it could discern from the steering controls that the car is keeping in between two parallel lines (lane keeping) or steering to cross one line and centering between the two adjacent parallel lines (lane change). So it could learn different driving behavior.

Right. It's very much a "black box" approach to autonomous driving. Sensor inputs in, human-like driving behavior out. And you could elicit steering behavior by changing the reward structure for the car; some points for continuing straight, some points for evading collisions, negative points for cutting off other drivers, lots of points for changing lanes. The tricky part there is you still need a heuristic to know what a lane is and what a lane change looks like. In the Mario Kart example, the reward system still had to know what "forward" was.
 
  • Like
Reactions: Sean Wagner
Right. It's very much a "black box" approach to autonomous driving. Sensor inputs in, human-like driving behavior out. And you could elicit steering behavior by changing the reward structure for the car; some points for continuing straight, some points for evading collisions, negative points for cutting off other drivers, lots of points for changing lanes. The tricky part there is you still need a heuristic to know what a lane is and what a lane change looks like. In the Mario Kart example, the reward system still had to know what "forward" was.

I wonder how e2e would handle weird edge cases like say a construction zone where they have painted over the lane lines. A human driver can discern which lines are the real lane lines and which lines are the old lane lines that they should ignore.How would e2e figure that out? It will just see that the driver applied steering to center in between a particular set of lines but that would not always be the case.

Also, you would need external navigation instructions to direct the "big picture" of where the car should go.

But I certainly hope DoJo yields positive results for Tesla's FSD effort. Perhaps, Tesla can combine the work that they have already done with machine learning and neural nets with DoJo to get to FSD faster?

And getting back on topic, yes, I could see how e2e could help a lot with enhanced summon.
 
Thanks. But I don't think I am understanding e2e. For example, the car sees a certain pattern of raw pixels and the driver moved the steering wheel to cross a line of pixels (lane change). But the car does not know that those pixels are lane lines or that those other pixels are cars. And just because I see that pattern of pixels does not mean I should automatically make that action (lane change). It feels like driving blind. It feels to me like it is lacking context which is critical to driving. So I am not seeing how e2e works to solve FSD.
Right - I'm not sure how the target location is specified. May be you do it in parts i.e. different set of tasks. First task would be to go along the road, with points for following rules and how long it takes. Second would be to turn to a different road. So, the NN figures out how to do the tasks but the procedural software sets the next target depending on navigation points.