Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
Pardon? The 300+K lines of control code have no perception capabilities whatsoever. It just reacts to inputs provided from the planner.

Here's my logic on this, someone please break it down:

1) the 300k lines of code pertain to both planning and control, essentially all heuristics

2) the heuristics are dependent on perceptual objects labeled in a human-understandable way. For example, the vision perception NN will predict 100s of objects in human-heuristics ("stop sign", "traffic light", "pedestrian", etc.) and convert them to a BEV vector space.

3) V12 gets rid of all these human-understandable heuristics, so there's no way to code the planner with human-understandable objects. Per Ashok, V12 has internal representations of human heuristics, and there's no good way to ask V12 to respond to a "stop sign" or "traffic light" for example. All of V12's normal driving functions are only based on video training of good drivers.

4) so it begs the question, if perceptual objects are no longer represented in human heuristics for all the driving, then what do we make of all the labeled objects from V11?
 
Last edited:
Here's my logic on this, someone please break it down:

1) the 300k lines of code pertain to both planning and control, essentially all heuristics
Yes the 300k lines of code are planning and control
2) the heuristics are dependent on perceptual objects labeled in a human-understandable way. For example, the vision perception NN will predict 100s of objects in human-heuristics ("stop sign", "traffic light", "pedestrian", etc.) and convert them to a BEV vector space.
Yes but if you strip the human labels from the detected objects and their position in space.
You still end up with objects. Lets say you have 3 objects that you listed and their location.

Object #1: id: 34354, label: pedestrian, location:xyz
Object #2: id: 54656, label:traffic light green, location:xyz
Object #3: id: 97356, label:stop sign, location:xyz

By going away from human programmed c++ hueristics you no longer need the label.
But you still have objects or you can call them vectors. But the system isn't being "told" what a traffic light is or to stop at a stop sign or how long to stop. Or to even pay attention to stop signs. But it learned from correlation that whenever there is id: 97356 at certain locations/orientations its training data comes to a smoothed stop.

Its satisfies all the statements by Ashok/Elon. What they are doing is gathering video data from the fleet, processing it with their ground truth NN (which is bigger than the one that runs live on the car). Which outputs are then used to train the driving policy NN. But no the raw pixels of the entire image are not being used as there would be too much noise. The signal to noise ratio would be very very low.

id: 34354, location:xyz
id: 54656, location:xyz
id: 97356, location:xyz


3) V12 gets rid of all these human-understandable heuristics, so there's no way to code the planner with human-understandable objects. Per Ashok, V12 has internal representations of human heuristics, and there's no good way to ask V12 to respond to a "stop sign" or "traffic light" for example. All of V12's normal driving functions are only based on video training of good drivers.
But the raw video itself is not being used in the training, just the outputs after its been processed by ground truth NN as I outlined above.
 
  • Like
Reactions: GSP
By going away from human programmed c++ hueristics you no longer need the label.
But you still have objects or you can call them vectors. But the system isn't being "told" what a traffic light is or to stop at a stop sign or how long to stop. Or to even pay attention to stop signs. But it learned from correlation that whenever there is id: 97356 at certain locations/orientations its training data comes to a smoothed stop.

So you're saying all objects are still labeled as generic objects with the same vector space as V11, but they're not given human semantics like stop sign, lane line, or traffic light?

I guess that makes sense, although I'd have to see if that agrees with all that was said during the livestream.
 
  • Like
Reactions: Bladerskb
Again, Tesla has never mentioned any intent on localization. They've stated the opposite.

This is a fantasy or wish list, nothing more.
Yea right.

That is why they are hiring ADAS drivers across all major metropolitan areas. Someone should learn to listen to others too.

“Drive local training efforts to continuously

Support the data collection effort to improve vehicle performance

Analyze test data, triage software issues and abnormal vehicle behaviors using Tesla in-house designed proprietary software tools“

I think I am done speaking here. I will come back to this thread in 12 months.
 
Personally, I think Tesla is probably working on V11 and V12 in parallel. When Elon talked about replacing 300k lines of code with NN, he was probably talking about V11 and replacing the previous code for planning and control with NN. So same modular architecture, just with new NN for planning and control. But then Tesla engineers decided to experiment with true end-to-end. So they started training a new NN with video to drive purely based on video input. They had some good initial results. So this became V12 alpha that Elon tested during the livestream.

But I think a more interesting question is can end-to-end achieve the safety and reliability needed for driverless? Right now, we know modular can achieve safe and reliable driverless. We don't know about E2E yet since it is still a very new approach. So far, E2E is showing a lot of promise but has yet to achieve the high reliability needed for driverless. And many experts believe there will be challenges to get there. So we need to wait and see. My view is use whatever architecture gets you to safe and reliable driverless. If that is modular, then use modular. If it is E2E, then use that.
 
Last edited:
  • Like
Reactions: enemji
Personally, I think Tesla is probably working on V11 and V12 in parallel. When Elon talked about replacing 300k lines of code with NN, he was probably talking about V11 and replacing the previous code for planning and control with NN. So same modular architecture, just with new NN for planning and control. But then Tesla engineers decided to experiment with true end-to-end. So they started training a new NN with video to drive purely based on video input. They had some good initial results. So this became V12 alpha that Elon tested during the livestream.

But I think a more interesting question is can end-to-end achieve the safety and reliability needed for driverless? Right now, we know modular can achieve safe and reliable driverless. We don't know about E2E yet since it is still a very new approach. So far, E2E is showing a lot of promise but has yet to achieve the high reliability needed for driverless. And many experts believe there will be challenges to get there. So we need to wait and see. My view is use whatever architecture gets you to safe and reliable driverless. If that is modular, then use modular. If it is E2E, then use that.
I personally think Tesla will give up on driverless (but never admit so) software---they will sell the robotaxi platform to others who want to try and lose money doing it.

E2E will make it easier to make a pretty decent L2++++ driver assistance product across many localities with much less effort.
 
The OpenPilot guys drove to Taco Bell using "end-to-end" last year, disengagement free, so of course it works, it just needs a lot more work before it's "good". They still have some hardcoded physics, they still have classic Controls instead of RL, but it's mostly end-to-end. They even have end-to-end navigation where it literally takes in a picture of a map with the navigation path drawn on it and the model outputs a vector that goes into the driving policy. It's not clear to me if Tesla is doing end-to-end for navigation in V12 or not.
 
@powertoold I know there has been a big debate in this thread about whether V12 is true end-to-end or not. This paper may shed some light on that. It specifies that e2e does not have to be one big black box with just planning/control and no intermediary outputs. e2e can be modular:

Note that the end-to-end paradigm does not necessarily indicate one black box with only planning/control outputs. It could be modular with intermediate representations and outputs (Fig. 1 (b)) as in the classical approach. In fact, several state-of-the-art systems [1, 2] propose a modular design but optimize all components together to achieve superior performance.

Source: https://arxiv.org/pdf/2306.16927.pdf

Based on this information, V12 does not have to be some new big black box that just takes in sensor input and directly outputs controls. Tesla did not have to throw away V11 and start from scratch with V12. Tesla replaced the planning code with NN and is now training the entire stack at once. V12 could be e2e and modular. What makes V12 e2e is that it is now NN from sensor input to control out and the entire stack can be trained at the same time, together.
 
I'm on vacation right now, so I haven't had time to analyze bladerskb's suggestion about labeling objects generically :)

I guess it could be possible, but I don't understand the advantage of labeling a stop sign as an "object-3728" vs "stop sign-3728."
 
End to End Diffusion Policy however is very interesting.
 
  • Like
Reactions: GSP and diplomat33
I wonder if a big push for V12's end-to-end / world model was to avoid ending up in a similar 300k+ lines of control code complexity for Optimus?


During the V12 demo when they were talking about capabilities of recognizing your face to pick you up and following your verbal instructions to adjust driving, it seemed a bit "too future" to be worked on now, but those types of interactions are somewhat required for Optimus to interact with the world.

This combination of vision and language inputs allows the world model to predict future vision based on language like what Tesla demoed at CVPR23: prompting to "drive straight" vs "change lane to right." Presumably this should also allow for predicting explanation language based on vision like what Wayve demoed at CVPR23: "I'm slowing down because there's a cyclist ahead."
 
I wonder if a big push for V12's end-to-end / world model was to avoid ending up in a similar 300k+ lines of control code complexity for Optimus?
I wonder if they decided to use neural networks for the robot from the very start and, after some time with it, figured that they could apply it equally well to the autonomy stuff.

I loved the fluid movement in that demo.
 
I wonder if they decided to use neural networks for the robot from the very start and, after some time with it, figured that they could apply it equally well to the autonomy stuff
I believe Tesla first shared this video including "End-to-end manipulation; Images -> Joint angles" back in May, so they indeed could have had it working on Optimus then applied it to driving. This was recently re-shared as part of Phil Duan's CVRP23 talk when introducing Vision Foundation Models for Autonomous Driving:

end-to-end optimus.jpg
 
  • Like
Reactions: JB47394
I wonder if they decided to use neural networks for the robot from the very start and, after some time with it, figured that they could apply it equally well to the autonomy stuff.

I don't think that is the right way to look at it. I think Tesla decided to use NN for both robots and AVs simply because it is the most promising approach to solving the problem. As I stated before, heuristic is too cumbersome and inflexible. Machine learning is the only way to solve AI. That is why everyone is moving to ML more and more to solve AI problems. In the case of AVs, some might do more modular, while others like Tesla are doing e2e, but everyone is using ML in part or all of their stack.
 
 

Yeah, maybe Optimus started the exploration of e2e. I am just saying that it makes sense for Tesla to use e2e for both Optimus and FSD. And keep in mind that robotics and AVs are similar fields. They are actually both robots, just specialized to different tasks. AVs are essentially robots in vehicle form specialized to driving whereas Optimus is a humanoid robot specialized to humanoid tasks. So it makes perfect sense for the Optimus team and the FSD team to collaborate.
 
Last edited:
The OpenPilot guys drove to Taco Bell using "end-to-end" last year, disengagement free, so of course it works, it just needs a lot more work before it's "good". They still have some hardcoded physics, they still have classic Controls instead of RL, but it's mostly end-to-end. They even have end-to-end navigation where it literally takes in a picture of a map with the navigation path drawn on it and the model outputs a vector that goes into the driving policy. It's not clear to me if Tesla is doing end-to-end for navigation in V12 or not.
I remember that video and it was nothing to boast about. Very crude, light traffic at the time but seemed to be in the way of traffic, even semis pulled away from it, sluggish starts for stop lights, difficulty staying within the lanes especially for turns... The driver should have intervened.
 
Last edited:
Here's one obvious point we need to clarify:

- does V12 reduce both control and planning heuristics or just control heuristics

My thought is that both planning and control heuristics were reduced. This is according to many comments during the livestream about how long to wait at stop signs and roundabouts, etc. These are decisions related to planning. Let me know if I'm incorrect.
 
Here's one obvious point we need to clarify:

- does V12 reduce both control and planning heuristics or just control heuristics

My thought is that both planning and control heuristics were reduced. This is according to many comments during the livestream about how long to wait at stop signs and roundabouts, etc. These are decisions related to planning. Let me know if I'm incorrect.
I’m not sure there is any evidence of any remaining written code that is substantial or needs/will be removed in the future.
 
  • Like
Reactions: powertoold