Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
All humans have their good side and their bad side. However with Elon this is vastly amplified in both directions (so yes he has and does super GREAT things). However AI is an example of his “bad” side showing. Elon is envious that he is not the center of AI development. So he disingenuously rails against AI development (even signing the letter to put a 6 month moratorium on development) only because he is not credited for AI development. So this Tweet is just an example of him trying to gain credit for AI development since there is no exact definition of where Tesla’s AI software starts and non AI software ends.
 
So yet another architectural rewrite? Will Tesla start from scratch or try to just merge the different NNs into one big one?

I am not surprised that Elon wants to do end-to-end AI. It is very consistent with his vision and AI centric approach. I am sure he sees it as the ideal end goal for autonomous driving.

Wayve is already doing end-to-end AI for their self-driving. But all they have are some nice demos, nothing they can actually deploy to the public yet. Experts like Waymo's Anguelov say that the industry is moving in the direction of end-to-end AI but that doing reliable autonomous driving everywhere with pure end-to-end AI is probably still far off. End-to-end AI is difficult to troubleshoot, difficult to train and likely requires new AI. End-to end AI will likely happen at some point but I think we can expect it will take awhile. I think we should expect delays, lots of Elon missed deadlines before V12 actually makes it to general release. It is also likely that V12 will not be pure end-to-end AI if Tesla cannot solve it in time. So we will see progress towards end-to-end AI but not the "finished" stack yet.
 
So yet another architectural rewrite? Will Tesla start from scratch or try to just merge the different NNs into one big one?

I am not surprised that Elon wants to do end-to-end AI. It is very consistent with his vision and AI centric approach. I am sure he sees it as the ideal end goal for autonomous driving.

Wayve is already doing end-to-end AI for their self-driving. But all they have are some nice demos, nothing they can actually deploy to the public yet. Experts like Waymo's Anguelov say that the industry is moving in the direction of end-to-end AI but that doing reliable autonomous driving everywhere with pure end-to-end AI is probably still far off. End-to-end AI is difficult to troubleshoot, difficult to train and likely requires new AI. End-to end AI will likely happen at some point but I think we can expect it will take awhile. I think we should expect lots of Elon missed deadlines before V12 actually makes it to general release.
Does diffusion come first or combined with end to end?
 
Does diffusion come first or combined with end to end?

I am not sure what you mean. As I see it, Tesla has two options:

1) Start from scratch. Basically, retrain a brand new end-to-end NN from zero and when it gets good enough, replace the old stack with the new end-to-end stack. The downside of this approach is we could see a major regression in features until the new end-to-end stack catches up with the features of the old stack.

2) Try to consolidate the existing NN until they become just one big end-to-end NN. With this approach, they might combine NNs and reduce the number of NNs over time. So, they might replace parts of the stack with end-to-end NN. For example, replace the traffic light and stop sign control with end-to-end AI. And just do that until the entire stack is end-to-end. This approach would likely prevent major regressions. It would be more gradual. Although, it might be difficult to do since you would need to manage the different pieces of the stack that are end-to-end and not end-to-end yet. And I don't know if there could be unintended consequences from one part of the stack affecting another.
 
  • Like
Reactions: Pdubs
I am not sure what you mean. As I see it, Tesla has two options:

1) Start from scratch. Basically, retrain a brand new end-to-end NN from zero and when it gets good enough, replace the old stack with the new end-to-end stack. The downside of this approach is we could see a major regression in features until the new end-to-end stack catches up with the features of the old stack.

2) Try to consolidate the existing NN until they become just one big end-to-end NN. With this approach, they might combine NNs and reduce the number of NNs over time. So, they might replace parts of the stack with end-to-end NN. For example, replace the traffic light and stop sign control with end-to-end AI. And just do that until the entire stack is end-to-end. This approach would likely prevent major regressions. It would be more gradual. Although, it might be difficult to do since you would need to manage the different pieces of the stack that are end-to-end and not end-to-end yet. And I don't know if there could be unintended consequences from one part of the stack affecting another.
Tesla are also likely swapping over to diffusion in place of transformers:
V12? Vision running on diffusion
 
I don't think this means they have singular models that are trained end-to-end, it just means that they're replacing some control/planning C++ code and heuristics (say for speed control) with ML. So in V12 there are cases where you have ML models all the way down from pixel to steering/accelerator. But they still will have this human-engineered "vector space" layer between perception and planning. I.e. The objects and lane markings and road edges and so forth that you see on the screen.
 
I would think latency concerns would require multiple smaller NN's, with some of them running more frequently than others. The path to that could easily be replacing current C++ code with new NN's one-for-one. Easy to train each of them individually with their limited functionality. After that NN's can be combined if that results in more efficient or better performance.

I would not expect a single NN to replace everything right now.
 
I don't think this means they have singular models that are trained end-to-end, it just means that they're replacing some control/planning C++ code and heuristics (say for speed control) with ML. So in V12 there are cases where you have ML models all the way down from pixel to steering/accelerator. But they still will have this human-engineered "vector space" layer between perception and planning. I.e. The objects and lane markings and road edges and so forth that you see on the screen.
I agree with this caution about the term "end-to-end". I certainly don't claim to know more about Machine Learning then Elon or any of his team, but my impression has been that the term is used to mean a system that is trained with a unified model for the complete task. This has numerous implications for performance, is arguably desirable in some respects, but challenges the ability to understand, optimize and troubleshoot the inner workings of the system during development.

I think there's a need for intermediate outputs to achieve the visualization as you mentioned, but also more broadly to have a manageable project in development.

From my understanding, even if you consider that engineer-understandable outputs are provided only for monitoring along the pipeline, their very presence implies some constraints to the architecture, and this would not be considered a true end-to-end approach in the proper application of that term.

Maybe someone with more background in ML can clarify whether Elon is using that turn properly, or just as a catch-phrase to mean that every element along the pipeline is individually implemented with an ML/NN approach. It would be unfortunate if Elon is polluting the use of the end-to-end descriptor to mean something it doesn't - but like I said I don't really know what others in the field would think.
 
Here is Elon's comment again:

Arguably, v11.4 should be v12.0, as there are so many major improvements. v12 is reserved for when FSD is end-to-end AI, from images in to steering, brakes & acceleration out.


Yann Le Cun, chief AI scientist at Facebook though has said the following:

As I have said numerous times over the last few months, hallucinations are an inevitable property of auto-regressive LLMs. That's not a major problem if you use them as writing aids or for entertainment purposes. Making them factual and controllable will require a major redesign.


The comment from Yann Le Cun refers to models trained on text, but I do wonder - would hallucinations in AI that Tesla would use be completely preventable? If not, then could you even solve FSD with end to end AI?
 
Here is Elon's comment again:



The comment from Yann Le Cun refers to models trained on text, but I do wonder - would hallucinations in AI that Tesla would use be completely preventable? If not, then could you even solve FSD with end to end AI?

Generally the vision systems aren't being used in generative mode (with hallucinations) during real time use.

LeCun is arguing at the level of advanced machine learning research (which is his job) regarding the practice and architectures of LLMs and similar models. Right now the LLMs ('autoregressive') predict p(word_n+1 | word_n, word_n-1, word_n-2) (on 'tokens' technically but words are good enough for here), which is optimized in the training phase. Then in the scoring/inference phase, i.e. when ChatGPT is running, it's evaluating these conditional distributions and then sampling from them or a function of them. If you ever see the word 'temperature' applied in these contexts or see the slider/choice between 'precise' and 'creative' in LLM generation, that refers to the modification of the multinomial probability model (usually softmax) from the base model when it is used for forward simulation. A cold temperature ('precise') means most likely to choose the most probable token, a hot temperature ('creative') will sample more from less probable tokens.

LeCun is talking about more sophisticated models where the planning and future is also governed by other types of neural networks with different architectures, going beyond naive forward token-by-token simulation and probabilistic generation.

To some degree, driving modules are already one of those models, as it's trying to optimize some functional of the where the future plan is supposed to go.

In the 'end to end' ML systems, Elon almost certainly means the lower version of that, meaning taking out hard-code logic for normal driving policy (except likely as outlier 'boundary' enforcement), so that there is a ML model for perception and policy and implementing the policy in in vehicle inputs.

True 'end-to-end' ML is something more difficult and very likely impractical. It would mean doing global gradient descent learning on a single master loss function ("did you drive the best way") and using that to generate all subtasks, from perception to policy. That's closer to DeepMind's AlphaZero (not AlphaGo) which bootstraps itself into doing both policy and perception for its closed world game tasks.

That would be impractical---all of the vision labelling and training would be rendered useless and you'd have to indirectly get it from policy error backpropagation.

That's not going to work--even humans have strong unsupervised experience on general vision and motion tasks outside driving (i.e. living from birth) and probably quite a bit is built-in evolutionarily created or constrained.

However, getting at least some sort of less ambitious system with all ML trained subcomponents might let there be an end-to-end fine-tuning to modify some perception and intermediate policy representations to better perform on the final policy outcomes.

After all, biological evolution has tuned the internal representations of animals to help their 'policy' (eat, live and propagate) more successfully.
 
The "end to end world model" means it's not really "end-to-end" which would be training from sensor inputs into motor and steering control signals.

It's an end-to-end perception model. This will be the 3rd major perception model (1st was frames and CNNs, 2nd is the combined video and occupancy networks now).

There will undoubtably be some performance degradations at first as the various policy networks and logic have been tuned to work on the prior perception models. Long term it might have some higher limits and be better but that's the inevitable consequence of major changes.

If humans had invasive brain surgery they'd likely need some new training to get back to old performance even if this surgery was to enhance their capabilities.
 
I think Elon means end2end vision:
y=f(x) where y is the output labels(such as lane lines, occupancy etc) and x is the dataset of video sequences and f is the function that you are trying to estimate. Not any intermediate steps during the training that you have to worry about such as IMU preprocessing the stack of previous images etc.

Then you have the case of end2end neural networks for control
y=f(x) where x are the video and y is the reference steering angle and velocity. That's in the future.

Hallucinations with LLM happen when you ask it question that is outside of its dataset and it does its best at answering it. ChatGPT is imo not doing a terrible job at this already which can be frustrating when it will not answer my questions badly like I want it to. With Tesla FSD the dataset is massive and also it's more of an iterative process of building the dataset than with GPT4 which mostly just is a huge dataset of everything they had available.
 
  • Like
  • Informative
Reactions: ZeApelido and QUBO