Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
I think Elon mentioned that the lines of code being replaced was for planning and control. So it would only be a rewrite of the planning/control component, not the entire stack. However, Elon also specified that V12 would be end-to-end, ie vision in, control out, which would imply a total rewrite. Although it is possible that Elon was simply referring to the individual components (perception, planning and control) being NN based and not actual end-to-end where there is only one NN for the entire stack. I suspect the latter. I think Tesla is keeping separate perception, planning and control components, just each component will be all NN now. I don't think Tesla will actually do true end-to-end where it is just one NN from vision to control. I feel like that would require too big of a rewrite.
Lol. I think Elon heard "end-to-end-AI" in some discussions with various tech bro rich dudes, and ordered his team to do it.

Yes, they will be training subnets for tasks like before and there won't be any new single 'Brain Net' retrained from scratch. Some kinds of management wants to hear "incremental improvements" when you turn the insides out because you need to, and other kinds of management wants to hear "total-rewrite" when you refactored one portion and cleaned up the rest a bit.
 
  • Funny
  • Like
Reactions: Nack and APotatoGod
empirical evidence that the tone of his words and implications are often widely variable from what later turns out to be the facts.

He's more careful about SpaceX as I think he understands that better (but may be less so with time), and there's more cautionary caveats from him about rockets.
With his comments regarding AlphaGo Zero (self play, fast progress), having a brain, not needing more engineers, his positive experience with V12, V12 getting tested now after still being in the oven a few days ago and talking about how the solution ended up being simple ... he's obviously excited by their progress and hopeful that they have the solution to FSD. If it was still a year away, I don't think Elon would be chirping about it the way he is. But, none of us really knows, we're only aggregating derivatives of the truth and trying to make good guesses.
 
Just making sure folk realise how little human written code is in this build. Literally not written as traditional lines of code. No debugging etc.

No debugging!? That's gonna save so much time /s

I get your point here... but are you suggesting the term "bug" shouldn't be used for systems that are sufficiently AI based? That's going to be hard to change how people talk about bugs. And more importantly an AI based system still can have problems that require engineering investigation and remediation (aka "bug fixing" traditionally). The tech stack, tooling, and techniques change, sure, but that doesn't necessarily translate to faster or easier development.
 
Although it is possible that Elon was simply referring to the individual components (perception, planning and control) being NN based and not actual end-to-end where there is only one NN for the entire stack
What makes individual components neural networks that start with video from one end and output controls on the other end "not actual end-to-end?"

If one neural network feeds its outputs to the next neural network, how is that different from one larger network with complex internal structure?
 
What makes individual components neural networks that start with video from one end and output controls on the other end "not actual end-to-end?"

If one neural network feeds its outputs to the next neural network, how is that different from one larger network with complex internal structure?

I think it is a question of architecture. If you have one big NN that takes in sensor input and outputs control then it is E2E. If you split up the NN into separate NN, it is not E2E. So yes, the one large NN will have complex internal structure but it is still one NN, not separate components. So one NN with complex structure is E2E, separate NN is called modular, not E2E. At least, that is my understanding of the distinction between the two architectures.
 
  • Informative
Reactions: gsmith123
I think it is a question of architecture. If you have one big NN that takes in sensor input and outputs control then it is E2E. If you split up the NN into separate NN, it is not E2E. So yes, the one large NN will have complex internal structure but it is still one NN, not separate components. So one NN with complex structure is E2E, separate NN is called modular, not E2E. At least, that is my understanding of the distinction between the two architectures.

I agree that's the generally understood definition of end-to-end, but the current practice for constructing large ML models like LLMs actually consists of a pipeline of multiple individual neural network modules trained and used in tandem.

For e.g. check out Andrej Karpathy's simplified implementation of a block of self attention for recreating GPT2: nanoGPT/model.py at master · karpathy/nanoGPT

A single forward pass involves passing the tensors first through the CausalSelfAttention module, and then through an MLP module. Each is a neural network in its own right, but no one would argue that this implementation of GPT2 isn't end-to-end.

Maybe the real distinction between a system that's end-to-end, and one that's not, is whether the modules are trained and used for inference as one, or whether they're are trained separately.
 
do you think that warrants some sort of HW3+ retrofit?
There will be no retrofits! Why do people think this will happen? There will undoubtedly be lawsuits about retrofits, but that’s a different thing.

hopeful that they have the solution to FSD.
Anyone who thinks this (and Elon clearly does not think this) is…well…overly optimistic…to be charitable.
If it was still a year away, I don't think Elon would be chirping about it the way he is.
😂
 
With his comments regarding AlphaGo Zero (self play, fast progress), having a brain, not needing more engineers, his positive experience with V12, V12 getting tested now after still being in the oven a few days ago and talking about how the solution ended up being simple ... he's obviously excited by their progress and hopeful that they have the solution to FSD. If it was still a year away, I don't think Elon would be chirping about it the way he is. But, none of us really knows, we're only aggregating derivatives of the truth and trying to make good guesses.
Related to FSD, there's not much truth that comes from Elon's mouth. He is the master of over promising, under delivering, and never looking back.
 
I agree that's the generally understood definition of end-to-end, but the current practice for constructing large ML models like LLMs actually consists of a pipeline of multiple individual neural network modules trained and used in tandem.

For e.g. check out Andrej Karpathy's simplified implementation of a block of self attention for recreating GPT2: nanoGPT/model.py at master · karpathy/nanoGPT

A single forward pass involves passing the tensors first through the CausalSelfAttention module, and then through an MLP module. Each is a neural network in its own right, but no one would argue that this implementation of GPT2 isn't end-to-end.

Maybe the real distinction between a system that's end-to-end, and one that's not, is whether the modules are trained and used for inference as one, or whether they're are trained separately.

The base foundation net really is true end-to-end as it's simply token prediction with a single loss function for the entire tuning run. Now people are adding on additional fine-tunings and other modifications which have different criteria, loss functions and requirements.
 
What makes individual components neural networks that start with video from one end and output controls on the other end "not actual end-to-end?"

If one neural network feeds its outputs to the next neural network, how is that different from one larger network with complex internal structure?

Good point. It can be end-to-end forward, but not backwards. The backwards phase only happens in the lab in development.

I think in this case, Elon means it will be end-to-end in the forward phase, the data flow doesn't involve neural networks feeding into a hard hand coded rules engine and then into more neural netowrks and then more rules engines, but will have ML trained soft decisioning systems everywhere in common cases.

What ML scientists mean by 'end-to-end' is on the training phase---that's a much more difficult problem. Like you find a loss function (optimization target) that you you can backpropagate errors through to every single trainable parameters and that solves the entire problem.
 
What ML scientists mean by 'end-to-end' is on the training phase---that's a much more difficult problem. Like you find a loss function (optimization target) that you you can backpropagate errors through to every single trainable parameters and that solves the entire problem.
Ideally ML scientists and everyone else should have the same definition? t makes meaningful discussion pretty hard otherwise...

Like you say, End to End (e2e) learning in the context of AI and ML is a technique where the model learns all the steps between the initial input phase and the final output result. This is a deep learning process where all of the different parts are simultaneously trained instead of sequentially. The whole system needs to be in the same back-prop loop for it to be end to end.

e2e.jpeg



I seriously doubt anyone will deploy a single e2e network for perception, planning and driving to production in the coming three years, if ever. As with any software architecture there are trade-offs with a modular architecture vs a monolith.
 
Last edited:
I think it is a question of architecture. If you have one big NN that takes in sensor input and outputs control then it is E2E. If you split up the NN into separate NN, it is not E2E. So yes, the one large NN will have complex internal structure but it is still one NN, not separate components. So one NN with complex structure is E2E, separate NN is called modular, not E2E. At least, that is my understanding of the distinction between the two architectures.
Armchair thoughts:

A modular NN can be one monolithic NN once completed (layers all directly stacked). The difference between those two cases is that intermediate layers have fixed meaning in the modular case. So training back props from that boundary layer to the previous as GIGO applies.
Success of a modular approach requires the designers can correctly determine what data is important to pass on and what can be discarded. (Making it difficult on themselves)

Without the quantized intermediates, training occurs over the whole NN and validation of intermediates becomes difficult. Internal structure itself may be more uniform due to removal of the category filters. They may be able to get some clues using a fMRI type heat map of the NN while stimulating it. What images propagate through? Training suite would need to monitor this as it's beyond human scale. (Making it compute heavy)

Modular: NN recognizes the X types of road signs, recognizes lane lines in our Y test cases, follows rational paths in the Z senarios.
E2E: NN followed the road safely and legally in the Z scenarios, but what data it acted on is less clear.
 
I can't see an ideal end to end single NN solution for FSD since the front end (perception) is more of a high bandwidth synchronous process while the back end (vehicle control) is more of a low bandwidth asynchronous process. The front end will occupy more time in the NN hardware core space - it could a 100's to 1 ratio.
 
I can't see an ideal end to end single NN solution for FSD since the front end (perception) is more of a high bandwidth synchronous process while the back end (vehicle control) is more of a low bandwidth asynchronous process. The front end will occupy more time in the NN hardware core space - it could a 100's to 1 ratio.
I don't follow. Assume stability control and such live outside the NN*. The time to go from photons to steering and thrust vector is 1 camera frame (well, 1 frame worth of processing time). The output (basically two values) is less data than the input (8 camera streams), but it doesn't matter if 90% of the processing time (NN size) is primarily visual and the other 10% is path planning or some other split.

*The drive unit control code is running a faster loop than the camera framerate (IIRC).
 
I don't follow. Assume stability control and such live outside the NN*. The time to go from photons to steering and thrust vector is 1 camera frame (well, 1 frame worth of processing time). The output (basically two values) is less data than the input (8 camera streams), but it doesn't matter if 90% of the processing time (NN size) is primarily visual and the other 10% is path planning or some other split.

*The drive unit control code is running a faster loop than the camera framerate (IIRC).
It can be implemented like that but there's no benefit. All that random indecisivenes (noise) is uncomfortable and less confidence inspiring for passengers and roadway occupants. Imagine a pedestrian in the crosswalk as all that noise drives steering, braking, and acceleration.