FSD v12.x (end to end AI)

powertoold · Sep 18, 2023

Artful Dodger said:
Pardon? The 300+K lines of control code have no perception capabilities whatsoever. It just reacts to inputs provided from the planner.

Here's my logic on this, someone please break it down:

1) the 300k lines of code pertain to both planning and control, essentially all heuristics

2) the heuristics are dependent on perceptual objects labeled in a human-understandable way. For example, the vision perception NN will predict 100s of objects in human-heuristics ("stop sign", "traffic light", "pedestrian", etc.) and convert them to a BEV vector space.

3) V12 gets rid of all these human-understandable heuristics, so there's no way to code the planner with human-understandable objects. Per Ashok, V12 has internal representations of human heuristics, and there's no good way to ask V12 to respond to a "stop sign" or "traffic light" for example. All of V12's normal driving functions are only based on video training of good drivers.

4) so it begs the question, if perceptual objects are no longer represented in human heuristics for all the driving, then what do we make of all the labeled objects from V11?

Bladerskb · Sep 18, 2023

powertoold said:
Here's my logic on this, someone please break it down:

1) the 300k lines of code pertain to both planning and control, essentially all heuristics

Yes the 300k lines of code are planning and control

powertoold said:
2) the heuristics are dependent on perceptual objects labeled in a human-understandable way. For example, the vision perception NN will predict 100s of objects in human-heuristics ("stop sign", "traffic light", "pedestrian", etc.) and convert them to a BEV vector space.

Yes but if you strip the human labels from the detected objects and their position in space.
You still end up with objects. Lets say you have 3 objects that you listed and their location.

Object #1: id: 34354, label: pedestrian, location:xyz
Object #2: id: 54656, label:traffic light green, location:xyz
Object #3: id: 97356, label:stop sign, location:xyz

By going away from human programmed c++ hueristics you no longer need the label.
But you still have objects or you can call them vectors. But the system isn't being "told" what a traffic light is or to stop at a stop sign or how long to stop. Or to even pay attention to stop signs. But it learned from correlation that whenever there is id: 97356 at certain locations/orientations its training data comes to a smoothed stop.

Its satisfies all the statements by Ashok/Elon. What they are doing is gathering video data from the fleet, processing it with their ground truth NN (which is bigger than the one that runs live on the car). Which outputs are then used to train the driving policy NN. But no the raw pixels of the entire image are not being used as there would be too much noise. The signal to noise ratio would be very very low.

id: 34354, location:xyz
id: 54656, location:xyz
id: 97356, location:xyz

powertoold said:
3) V12 gets rid of all these human-understandable heuristics, so there's no way to code the planner with human-understandable objects. Per Ashok, V12 has internal representations of human heuristics, and there's no good way to ask V12 to respond to a "stop sign" or "traffic light" for example. All of V12's normal driving functions are only based on video training of good drivers.

But the raw video itself is not being used in the training, just the outputs after its been processed by ground truth NN as I outlined above.

powertoold · Sep 18, 2023

Bladerskb said:
By going away from human programmed c++ hueristics you no longer need the label.
But you still have objects or you can call them vectors. But the system isn't being "told" what a traffic light is or to stop at a stop sign or how long to stop. Or to even pay attention to stop signs. But it learned from correlation that whenever there is id: 97356 at certain locations/orientations its training data comes to a smoothed stop.

So you're saying all objects are still labeled as generic objects with the same vector space as V11, but they're not given human semantics like stop sign, lane line, or traffic light?

I guess that makes sense, although I'd have to see if that agrees with all that was said during the livestream.

enemji · Sep 18, 2023

uscbucsfan said:
Again, Tesla has never mentioned any intent on localization. They've stated the opposite.

This is a fantasy or wish list, nothing more.

Yea right.

That is why they are hiring ADAS drivers across all major metropolitan areas. Someone should learn to listen to others too.

“Drive local training efforts to continuously
•
Support the data collection effort to improve vehicle performance

Analyze test data, triage software issues and abnormal vehicle behaviors using Tesla in-house designed proprietary software tools“

I think I am done speaking here. I will come back to this thread in 12 months.

diplomat33 · Sep 19, 2023

Personally, I think Tesla is probably working on V11 and V12 in parallel. When Elon talked about replacing 300k lines of code with NN, he was probably talking about V11 and replacing the previous code for planning and control with NN. So same modular architecture, just with new NN for planning and control. But then Tesla engineers decided to experiment with true end-to-end. So they started training a new NN with video to drive purely based on video input. They had some good initial results. So this became V12 alpha that Elon tested during the livestream.

But I think a more interesting question is can end-to-end achieve the safety and reliability needed for driverless? Right now, we know modular can achieve safe and reliable driverless. We don't know about E2E yet since it is still a very new approach. So far, E2E is showing a lot of promise but has yet to achieve the high reliability needed for driverless. And many experts believe there will be challenges to get there. So we need to wait and see. My view is use whatever architecture gets you to safe and reliable driverless. If that is modular, then use modular. If it is E2E, then use that.

DrChaos · Sep 19, 2023

diplomat33 said:
Personally, I think Tesla is probably working on V11 and V12 in parallel. When Elon talked about replacing 300k lines of code with NN, he was probably talking about V11 and replacing the previous code for planning and control with NN. So same modular architecture, just with new NN for planning and control. But then Tesla engineers decided to experiment with true end-to-end. So they started training a new NN with video to drive purely based on video input. They had some good initial results. So this became V12 alpha that Elon tested during the livestream.

But I think a more interesting question is can end-to-end achieve the safety and reliability needed for driverless? Right now, we know modular can achieve safe and reliable driverless. We don't know about E2E yet since it is still a very new approach. So far, E2E is showing a lot of promise but has yet to achieve the high reliability needed for driverless. And many experts believe there will be challenges to get there. So we need to wait and see. My view is use whatever architecture gets you to safe and reliable driverless. If that is modular, then use modular. If it is E2E, then use that.

I personally think Tesla will give up on driverless (but never admit so) software---they will sell the robotaxi platform to others who want to try and lose money doing it.

E2E will make it easier to make a pretty decent L2++++ driver assistance product across many localities with much less effort.

eli_ · Sep 19, 2023

The OpenPilot guys drove to Taco Bell using "end-to-end" last year, disengagement free, so of course it works, it just needs a lot more work before it's "good". They still have some hardcoded physics, they still have classic Controls instead of RL, but it's mostly end-to-end. They even have end-to-end navigation where it literally takes in a picture of a map with the navigation path drawn on it and the model outputs a vector that goes into the driving policy. It's not clear to me if Tesla is doing end-to-end for navigation in V12 or not.

diplomat33 · Sep 23, 2023

@powertoold I know there has been a big debate in this thread about whether V12 is true end-to-end or not. This paper may shed some light on that. It specifies that e2e does not have to be one big black box with just planning/control and no intermediary outputs. e2e can be modular:

Note that the end-to-end paradigm does not necessarily indicate one black box with only planning/control outputs. It could be modular with intermediate representations and outputs (Fig. 1 (b)) as in the classical approach. In fact, several state-of-the-art systems [1, 2] propose a modular design but optimize all components together to achieve superior performance.

Source: https://arxiv.org/pdf/2306.16927.pdf

Based on this information, V12 does not have to be some new big black box that just takes in sensor input and directly outputs controls. Tesla did not have to throw away V11 and start from scratch with V12. Tesla replaced the planning code with NN and is now training the entire stack at once. V12 could be e2e and modular. What makes V12 e2e is that it is now NN from sensor input to control out and the entire stack can be trained at the same time, together.

powertoold · Sep 23, 2023

I'm on vacation right now, so I haven't had time to analyze bladerskb's suggestion about labeling objects generically

I guess it could be possible, but I don't understand the advantage of labeling a stop sign as an "object-3728" vs "stop sign-3728."

Bladerskb · Sep 23, 2023

End to End Diffusion Policy however is very interesting.

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

This paper introduces Diffusion Policy, a new way of generating robot behavior by representing a robot's visuomotor policy as a conditional denoising diffusion process. We benchmark Diffusion Policy across 12 different tasks from 4 different robot manipulation benchmarks and find that it...

diffusion-policy.cs.columbia.edu

Mardak · Sep 24, 2023

I wonder if a big push for V12's end-to-end / world model was to avoid ending up in a similar 300k+ lines of control code complexity for Optimus?

https://twitter.com/x/status/1705728820693668189

During the V12 demo when they were talking about capabilities of recognizing your face to pick you up and following your verbal instructions to adjust driving, it seemed a bit "too future" to be worked on now, but those types of interactions are somewhat required for Optimus to interact with the world.

This combination of vision and language inputs allows the world model to predict future vision based on language like what Tesla demoed at CVPR23: prompting to "drive straight" vs "change lane to right." Presumably this should also allow for predicting explanation language based on vision like what Wayve demoed at CVPR23: "I'm slowing down because there's a cyclist ahead."

spacecoin · Sep 24, 2023

Mardak said:
I wonder if a big push for V12's end-to-end / world model was to avoid ending up in a similar 300k+ lines of control code complexity for Optimus?

Wow! Sorting coloured cubes into boxes... Sounds like a demo from 7-8 years ago from some intern's summer project. Tesla is really miles ahead of everyone in marketing.

JB47394 · Sep 24, 2023

Mardak said:
I wonder if a big push for V12's end-to-end / world model was to avoid ending up in a similar 300k+ lines of control code complexity for Optimus?

I wonder if they decided to use neural networks for the robot from the very start and, after some time with it, figured that they could apply it equally well to the autonomy stuff.

I loved the fluid movement in that demo.

Mardak · Sep 24, 2023

JB47394 said:
I wonder if they decided to use neural networks for the robot from the very start and, after some time with it, figured that they could apply it equally well to the autonomy stuff

I believe Tesla first shared this video including "End-to-end manipulation; Images -> Joint angles" back in May, so they indeed could have had it working on Optimus then applied it to driving. This was recently re-shared as part of Phil Duan's CVRP23 talk when introducing Vision Foundation Models for Autonomous Driving:

diplomat33 · Sep 24, 2023

JB47394 said:
I wonder if they decided to use neural networks for the robot from the very start and, after some time with it, figured that they could apply it equally well to the autonomy stuff.

I don't think that is the right way to look at it. I think Tesla decided to use NN for both robots and AVs simply because it is the most promising approach to solving the problem. As I stated before, heuristic is too cumbersome and inflexible. Machine learning is the only way to solve AI. That is why everyone is moving to ML more and more to solve AI problems. In the case of AVs, some might do more modular, while others like Tesla are doing e2e, but everyone is using ML in part or all of their stack.

Mardak · Sep 24, 2023

diplomat33 said:
I think Tesla decided to use NN for both robots and AVs simply because it is the most promising approach to solving the problem

It also helps that someone decided to allow for very tight collaboration between Autopilot and Optimus teams to identify these common problems and solutions.

Phil Duan: Because we have the same computer platform, it's also fairly straightforward for us to build the same AI software stack. So the Optimus and the Autopilot software teams will work very close together. We share the same repo, and when we build the computer vision models, we try to leverage things so that we don't have to build two separate applications. This also kind of enforces [upon] us [that] we never build new software; [instead,] we think about "how do we build something that is foundational that that can be shared among the two applications."

Ashok Elluswamy: I think the same should apply for humanoid robot in the sense that you can prompt it arbitrary things: you can ask it to pick up a cup or walk to some door; and then the model should be able to imagine what picking up a cup would look like or walking to a door would look like. And I don't see why it should be any different between the car and the robot.

Perhaps Optimus started the exploration of end-to-end control that led to a more generalized foundation that now can control cars with V12 as well as Optimus and other form factors?

https://twitter.com/x/status/1705875461224734946

diplomat33 · Sep 24, 2023

Mardak said:
It also helps that someone decided to allow for very tight collaboration between Autopilot and Optimus teams to identify these common problems and solutions.

Phil Duan: Because we have the same computer platform, it's also fairly straightforward for us to build the same AI software stack. So the Optimus and the Autopilot software teams will work very close together. We share the same repo, and when we build the computer vision models, we try to leverage things so that we don't have to build two separate applications. This also kind of enforces [upon] us [that] we never build new software; [instead,] we think about "how do we build something that is foundational that that can be shared among the two applications."

Ashok Elluswamy: I think the same should apply for humanoid robot in the sense that you can prompt it arbitrary things: you can ask it to pick up a cup or walk to some door; and then the model should be able to imagine what picking up a cup would look like or walking to a door would look like. And I don't see why it should be any different between the car and the robot.

Perhaps Optimus started the exploration of end-to-end control that led to a more generalized foundation that now can control cars with V12 as well as Optimus and other form factors?

https://twitter.com/x/status/1705875461224734946

Yeah, maybe Optimus started the exploration of e2e. I am just saying that it makes sense for Tesla to use e2e for both Optimus and FSD. And keep in mind that robotics and AVs are similar fields. They are actually both robots, just specialized to different tasks. AVs are essentially robots in vehicle form specialized to driving whereas Optimus is a humanoid robot specialized to humanoid tasks. So it makes perfect sense for the Optimus team and the FSD team to collaborate.

kabin · Sep 24, 2023

eli_ said:
The OpenPilot guys drove to Taco Bell using "end-to-end" last year, disengagement free, so of course it works, it just needs a lot more work before it's "good". They still have some hardcoded physics, they still have classic Controls instead of RL, but it's mostly end-to-end. They even have end-to-end navigation where it literally takes in a picture of a map with the navigation path drawn on it and the model outputs a vector that goes into the driving policy. It's not clear to me if Tesla is doing end-to-end for navigation in V12 or not.

I remember that video and it was nothing to boast about. Very crude, light traffic at the time but seemed to be in the way of traffic, even semis pulled away from it, sluggish starts for stop lights, difficulty staying within the lanes especially for turns... The driver should have intervened.

powertoold · Sep 25, 2023

Here's one obvious point we need to clarify:

- does V12 reduce both control and planning heuristics or just control heuristics

My thought is that both planning and control heuristics were reduced. This is according to many comments during the livestream about how long to wait at stop signs and roundabouts, etc. These are decisions related to planning. Let me know if I'm incorrect.

Buckminster · Sep 25, 2023

powertoold said:
Here's one obvious point we need to clarify:

- does V12 reduce both control and planning heuristics or just control heuristics

My thought is that both planning and control heuristics were reduced. This is according to many comments during the livestream about how long to wait at stop signs and roundabouts, etc. These are decisions related to planning. Let me know if I'm incorrect.

I’m not sure there is any evidence of any remaining written code that is substantial or needs/will be removed in the future.

FSD v12.x (end to end AI)

Active Member

Senior Software Engineer

Active Member

Active Member

Average guy who loves autonomous vehicles

Active Member

Member

Average guy who loves autonomous vehicles

Active Member

Senior Software Engineer

Active Member

Active Member

Active Member

Active Member

Average guy who loves autonomous vehicles

Active Member

Average guy who loves autonomous vehicles

Active Member

Active Member

Well-Known Member

Similar threads