Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
Everything that James Douma said resonated with me perfectly well. I have every expectation that they've replaced the control module with a neural network that has been trained separately. Andrej Karpathy said that a monolithic neural network would suffer from loss of signal during training. That's why you start with training of the individual chunks. Once you've got them where you want, you can consider allowing the borders between those chunks to shift as additional training dictates.

Which is what a V11 system with a neural control module would be. Neural networks all the way.

I can't see them duplicating the V11 visualization without relying on the V11 software. When they have a monolithic neural network solution, I wonder if they'll even bother with a visualization. I've always thought that the visualization stuff was just an engineering diagnostic that they turned into a feature (which only serves to distract the driver). In other words, the design of the software happened to have that data lying around, so they created a visualization. In a monolithic system, that data won't naturally come into being. If they want to keep the visualization they're going to have to train the system to provide it.

I'm very confident that V12 is not what Douma thinks it is in that video.

I'm not saying V12 is a monolithic neural network, but it is vastly different than V11 in that it gets rid of human semantics and heuristics in all parts of the architecture.

The goal of V12 is to make it like ChatGPT. Elon has made this clear many times.

ChatGPT is not explicitly taught what a definition is, what a noun, subject, verb, grammar structure, etc. etc. It makes internal representations of these human concepts based on the vast data it's given.

Likewise, V12 makes no use of human concepts like lanes, stop signs, vehicles, birds eye view, pedestrian, cyclist, etc.

V11 was totally dependent on autolabeled and manually-labeled human concepts like this.

You can't feed V12 "dirty" ideas like the V11 BEV and autolabeled world representation and expect a good output. e2e thrives on pure, raw data and you need to massage the architecture and data to get the outputs you desire.
 
When they have a monolithic neural network solution, I wonder if they'll even bother with a visualization. I've always thought that the visualization stuff was just an engineering diagnostic that they turned into a feature (which only serves to distract the driver).

If we want to take the "elevator automation" example, the visualization is the equivalent of the voice saying "Floor 5." Gives people who were used to an elevator operator something calming to focus on.
 
If we want to take the "elevator automation" example, the visualization is the equivalent of the voice saying "Floor 5." Gives people who were used to an elevator operator something calming to focus on.
Perhaps the equivalent in spirit, but not at all in practice, which is my point. I would welcome announcements of things that the software is initiating (such as lane changes), but the visualization itself is a distraction, a flaw. I'm vaguely surprised that NHTSA allows it. Its only real purpose is as a diagnostic for the developers, and we aren't the developers. So to us it is something to look at when we should be looking at the world around us.

Speaking of diagnostics, I wonder how you diagnose the behavior of a monolithic neural network such as a Large Language Model. Can an autonomy system ever go completely opaque to its developers, or will the network always have to kick out diagnostic information above and beyond that required for operating the vehicle? If the system continues to deliver diagnostic information, then that suggests that the neural network would be influenced by that need. It would be like the language of the system, with the need to maintain and respect the concepts of those diagnostic elements.

For example, if the developers want to know where the cars are that are nearby, then the neural network would be trained to conjure up that information. That implies that the neural network would leverage that knowledge directly in order to control the car, as opposed to just going with some instinctive notion of whether to turn here or there.
 
I'm very confident that V12 is not what Douma thinks it is in that video.

I'm not saying V12 is a monolithic neural network, but it is vastly different than V11 in that it gets rid of human semantics and heuristics in all parts of the architecture.

The goal of V12 is to make it like ChatGPT. Elon has made this clear many times.

ChatGPT is not explicitly taught what a definition is, what a noun, subject, verb, grammar structure, etc. etc. It makes internal representations of these human concepts based on the vast data it's given.

Likewise, V12 makes no use of human concepts like lanes, stop signs, vehicles, birds eye view, pedestrian, cyclist, etc.

V11 was totally dependent on autolabeled and manually-labeled human concepts like this.

You can't feed V12 "dirty" ideas like the V11 BEV and autolabeled world representation and expect a good output. e2e thrives on pure, raw data and you need to massage the architecture and data to get the outputs you desire.
I am trying to bring up a parallel between the two sets of arguments here what what powertoold says and JB47394 says.

Learning to speak or write English sentences can be done in one of two ways:

- by simply listening to others speak as you grow up as a child, by the very fact that you live in an environment where everyone speaks English. In that environment a kid learns to speak fluent English even without the need to know any concepts of what a noun, a verb, adjective is. You can have zero education in formal grammar, but can speak perfect grammatically correct sentences, all learnt over many years simply by listening to adults.

- The other method is to learn the proper grammar and then learn to form sentences using that grammar. Typically this happens in countries such as Asia and Africa where English is the 2nd language. As you can empirically see, this method produces subjects who are less fluent than the former.
 
V12 is an entirely different approach. That's why you have that new article excerpt from the biography.

If V12 is simply a neural planner on top of V11 labeled perception, there wouldn't be such a rouse at the AP team.

The old paradigm was basically to work on perception, planning, and control separately, as human concepts. V12 is not this.


“Amazing work, guys,” Musk said at the end. “This is really impressive.” They all then went to the weekly meeting of the Autopilot team, where 20 guys, almost all in black T-shirts, sat around a conference table to hear the verdict. Many had not believed that the neural network project would work. Musk declared that he was now a believer and they should move their resources to push it forward.
 
I don't think the new world model evolved from the occupancy network
At both AI Day 2022 and CVPR 2023, these ideas of foundation/world models for computer vision were brought up in the context of explaining occupancy networks. I agree that the existing V11 architecture needed to support occupancy networks wasn't sufficient to be the world model, but it's probably closer in architecture than say large language models for ChatGPT.

One potential change to the existing occupancy networks architecture is incorporating diffusion techniques, and perhaps not coincidentally, Musk mentioned diffusion around the time of the April 2023 end-to-end internal demo. Additionally, Elluswamy at CVPR briefly mentioned diffusion before showing videos of the world model predicting the future.

My guess is diffusion helps train the world model in a self-supervised fashion similar to large language models predicting the next word based on previous words, so here it predicts the next frame based on previous frames. This part of training doesn't need labels for objects, lanes, etc. as the internal structure of the model learns to be consistent across all cameras. Then with this pre-trained model with world understanding, the "next frame" head can be replaced with those actually useful for self-driving -- finetuning other tasks such as end-to-end control.
 
Looking forward to hearing more about v12 but I suspect v12 will have many of the same issues as v11.x and then some. It's not confidence inspiring to hear all the mind blowing praise while going this far down the path and then suddenly switch to another direction.

Something tells me the light at the end of the v11.x tunnel required a desperate move to buy more development time.
 
One potential change to the existing occupancy networks architecture is incorporating diffusion techniques, and perhaps not coincidentally, Musk mentioned diffusion around the time of the April 2023 end-to-end internal demo. Additionally, Elluswamy at CVPR briefly mentioned diffusion before showing videos of the world model predicting the future.

It's difficult to parse what Elon is alluding to in his posts because he frequently jumps between V12 and V11 concepts.

If we are to take everything Elon said in the X spaces with Omar and also during the livestream, we get a clear picture of V12's approach.

I've said this before, I have no idea how V12 works because to me, it seems like it shouldn't work well (like many on the core AP team).

That said, there are logical concepts at hand when thinking about V12:

1) Elon said that V12 has no human-like concept of a stop sign/light/pedestrians/etc.. All these concepts pertain to perception, so what he's saying is throw V11's perception objects out the window. No more "VRUs", "lane network", all the stuff we saw in V11 release notes.

2) Then he talks about V12's ability to plan and navigate simply based on a GPS point, without any maps or anything. Well, that means we need to throw all of V11's language of lanes, coarse lane maps, etc. out the window as well. V11's planner is no longer relevant to V12.

3) V12 takes in video + metadata (includes navigation destination point and maps guidance data) from good drivers, and creates the whole perception, planning, and control outputs in "one" go. Perception, planning, and control are no longer separate concepts, just one raw stream of video in, and all the decisions out.

4) The occupancy network is the same way. It's a flawed human concept. The ON was made to create volumes in space where the concept was that geometry is more important than ontology for arbitrary road obstacles. However, ON isn't working out well because there are many instances where opaque pixels don't necessarily mean an obstacle to avoid (Ashok brought up smoke as an example).

5) The solution to arbitrary obstacle avoidance in V12 is to "simply" include many many video examples of good drivers avoiding obstacles... no more occupancy network middle-man.

To think about V12 properly, you can use the sensor fusion example. When you have lidar radar and vision, and one disagrees with another, what do you do? This is V11 concept. V11's BEVs, autolabels, occupancy network, lane networks, etc. can be thought of as different "sensors" that need to be fused. V12 tries to get rid of all these and goes all-in on vision, in a "first principles" kind of way.
 
Last edited:
  • Informative
Reactions: APotatoGod
What really blows my mind is that until this v12 AI revelation by the twit and his team earlier this year they didn’t have a solution. They were futzing around with software that wasn’t going to get them to L3+ and they knew it.

FSDb 11.4.x is a trainwreck and that was their most recent “false horizon.” They’ve been blowing smoke up our backsides for years and now here we are with yet another complete rewrite coming down the tubes.

This v12 edition is a new horizon, I guess. Time will tell if it’s false.
 
What really blows my mind is that until this v12 AI revelation by the twit and his team earlier this year they didn’t have a solution. They were futzing around with software that wasn’t going to get them to L3+ and they knew it.

FSDb 11.4.x is a trainwreck and that was their most recent “false horizon.” They’ve been blowing smoke up our backsides for years and now here we are with yet another complete rewrite coming down the tubes.

This v12 edition is a new horizon, I guess. Time will tell if it’s false.
You’re overthinking the situation. No question: FSD is a research project, where research is defined as, “The process of running up alleys to find out if there’re blind.”

Nobody has invented a working L5 (or whatever it’s called) system yet. And there’s this thing: there may very well be theoretical limits on a FSD system that the V3 computer processor exceeds. But just because there’s a theoretical limit that the V3 passes doesn’t mean that building practical application software that approaches that limit is some kind of easy slam dunk. Or that it’s obvious, as you’re implying, that a particular method hasn’t, or is not likely to, get to that level of performance.

What happens in an environment like this is that it becomes critical that people be smart, and realize that a particular approach isn’t going to work, and creative as all get out, to come up with new approaches that get around the perceived bottlenecks. (And, believe you me, it’s HARD to back off from a failing approach. There’s always that one new thing to try.. been there, done that, got the T-shirt. And coming up with new approaches.. good engineers who can do that are literally worth their weight in gold. While companies have foundered on the lack of good ideas.)

What Musk was doing, working as a with-it sounding board for the near geniuses doing the heavy lifting, but not too close so as to become too enraptured of a particular approach, and willing to accept that an approach may fail, and simply to pivot to something else, is priceless work.

I’m retired now, but I’m pretty sure I would run out of both fingers and toes counting up the managers I’ve known who couldn’t do that, or who did it poorly.

So, nah, this wasn’t some kind of scam. This is finding that this alley’s full of bricks, so let’s try this bit more promising one, with a bunch of management and engineers that can make it happen.

Fun times.
 
What really blows my mind is that until this v12 AI revelation by the twit and his team earlier this year they didn’t have a solution. They were futzing around with software that wasn’t going to get them to L3+ and they knew it.

FSDb 11.4.x is a trainwreck and that was their most recent “false horizon.” They’ve been blowing smoke up our backsides for years and now here we are with yet another complete rewrite coming down the tubes.

This v12 edition is a new horizon, I guess. Time will tell if it’s false.
V12 will be released 6 months after this thread ends. 😂
 
The goal of V12 is to make it like ChatGPT. Elon has made this clear many times.

ChatGPT is not explicitly taught what a definition is, what a noun, subject, verb, grammar structure, etc. etc. It makes internal representations of these human concepts based on the vast data it's given.
ChatGPT is *not* a good example here, since it suffers badly from hallucinations ... something that, for obvious reasons, is not desirable in a self-driving car.

The fact is, there is room for BOTH NN and traditional code-driven logic .. the former is probabilistic but more malleable, the latter deterministic but more brittle. Things like emergency braking you want to be as deterministic as possible, while stuff like which lane to choose can be more relaxed, since the consequences of being wrong are mostly just inconvenience, whereas AEB can mean (literally) life or death.
 
ChatGPT is *not* a good example here, since it suffers badly from hallucinations ... something that, for obvious reasons, is not desirable in a self-driving car.

The fact is, there is room for BOTH NN and traditional code-driven logic .. the former is probabilistic but more malleable, the latter deterministic but more brittle. Things like emergency braking you want to be as deterministic as possible, while stuff like which lane to choose can be more relaxed, since the consequences of being wrong are mostly just inconvenience, whereas AEB can mean (literally) life or death.

ChatGPT is the example given by Elon multiple times in relation to V12 though.

And ChatGPT hallucinates for many reasons, the main ones are it's fed lots of data mostly indiscriminately, and the human prompts are sparse and flawed.

V12 is different in that the data is well-massaged / curated with a tight feedback loop on failures with the data engine. V12's data consists of extreme examples to the mundane. Also, there's no flawed human prompts, the "prompts" are the same types of data-rich pixel streams as its training set.
 
And ChatGPT hallucinates for many reasons, the main ones are it's fed lots of data mostly indiscriminately, and the human prompts are sparse and flawed.

V12 is different in that the data is well-massaged / curated with a tight feedback loop on failures with the data engine. V12's data consists of extreme examples to the mundane. Also, there's no flawed human prompts, the "prompts" are the same types of data-rich pixel streams as its training set.
How do you know this about V12? Or are you just speculating? I'm also unclear on what your last sentence means?
 
  • Informative
Reactions: Artful Dodger
ChatGPT is the example given by Elon multiple times in relation to V12 though.

And ChatGPT hallucinates for many reasons, the main ones are it's fed lots of data mostly indiscriminately, and the human prompts are sparse and flawed.

V12 is different in that the data is well-massaged / curated with a tight feedback loop on failures with the data engine. V12's data consists of extreme examples to the mundane. Also, there's no flawed human prompts, the "prompts" are the same types of data-rich pixel streams as its training set.
It's worth noting that ChatGPT (GPT 3.5) is trained with Reinforcement Learning with Human Feedback.

But in regards to Tesla's V12. Remember that AlphaGo, AlphaChess, etc. were only possible because there were great chess engines and Go engines beforehand. Crazy Stone, for example, was a Go-playing AI system that played at a professional level. These systems' existence made it possible for the engineers to understand how these systems were achieving professional level and where they could be failing. Then, replicating these algorithms in neural network architectures became possible, which comes with the multiplier improvement that using ML offers, automatically catapulting them to world level.

But if these systems didn't exist, it would have been harder and taken longer for the engineers to replicate what needed to be done in the neural network. Remember, it's the humans who are the ones architecturing these networks, and they architect it based on what they believe is necessary to achieve a particular result. But if they don't know, then they won't be able to assemble a set of architectures that would hit the mark that they are looking for. It would literally be playing darts in pitch darkness.

The same was the case with AlphaZero. AlphaZero was only possible BECAUSE of the lessons from AlphaGo. Because it allowed them to see what made AlphaGo work and then be able to turn the overly engineered parts and replace it with RL-self play.

What Tesla, however, is doing is rushing to do an E2E planner without actually having a human-level system (or something close to it) as a working foundation to replicate. So let's be generous and say that FSD Beta as a whole has a disengagement every 100 miles. Changing from a C++ planner to a 99% NN planner isn't going to give you a 1,000x improvement. Let's say in a fairy-tale scenario that it gives them a 10x improvement. They are still left with a system that fails every 1,000 miles, which is far short from what they need to go driverless.

Take in contrast Waymo. They took a system that was ~100% NN perception, mostly NN for prediction, and mostly C++ code for planner. Basically, the planner was just like the previous Go and chess engines before AlphaGo. But at least they had a system that worked in suburbs. But the problem is that its limits were very low. So it will fail when it runs into construction. But because Waymo had a foundation of a system with a robotics planner that at the very least was at human level, they can now turn the robotic planner into an ML planner piece by piece while retaining the performance and getting all the ML benefits for free.

So they went from mostly C++ code to "ML FIRST" planner. I would think of it as a ~60% NN planner. This has allowed them to go from just being able to do driverless in suburb environment and in light rain and zero construction to being able to handle environments that include city, urban, heavy rain, heavy fog, and constructions, storms, debris, roadblocks, road detour, dead ends, etc. All while being driverless.

But if they didn't already have a robotics (C++) planner that achieved human-level performance and better. Yes, they could transition into a ~100% NN planner. But it won't give the system human-level performance. Because it's the knowledge that you know that you can implement as a NN that matters not the fact that you have an NN planner. That is why Wayve, even though they have a 100% NN planner, is nowhere near human-level performance. This is why OpenPilot which constantly markets end-to-end so they can hitch onto Tesla and sell devices. Doesn't even have a system.

TLDR: By building system 1.0, you learn how to properly build system 2.0. This is the case in everything. Whether you are building a web application, phone application, game, regular software, ML software, hardware, building architecture, cooking, etc.

People have falsely thought that if you geofence to a city you automatically get human-level performance. But if this were true, then everyone would be driving driverless. It's clearly not true because for driverless cars in the US. It's basically Waymo....Cruise and then everyone else. While the gap between Waymo and Cruise is huge, the gap between Waymo and everyone else is unimaginable.

So yes, V12 will provide improvements, add new features (pullover, parking lots, u-turn maybe, emergency vehicle handling maybe, 3-point turn maybe, dead ends maybe, etc) But the performance will be similar or incrementally better than V11 (2-3x).
 
I don't think the new world model evolved from the occupancy network. Ashok pointed out the shortcomings of the occupancy network in his recent cvpr talk:

Yeah not sure why people are confusing this. But you're right, its completely different.
The world model is just a generative ai model that generates a time series output (video, segmentations, etc) and it can be conditioned by language and other parameters, etc.

Its similar to what Wavye is working on called GAIA.



For details about the World Model Watch this.
Start at 27 mins 30 secs.
 
Last edited:
This is why OpenPilot which constantly markets end-to-end so they can hitch onto Tesla and sell devices. Doesn't even have a system.
At risk of sounding like I’m defending Tesla, George Hotz *did* say recently in an interview that Tesla FSD will always be a few years ahead of the OpenPilot team because Tesla has better access to compute and “they’re not doing anything wrong.” Take that with a grain of salt.
 
How do you know this about V12? Or are you just speculating? I'm also unclear on what your last sentence means?

I'm only pointing out the differences in the training data and also inputs during inference.

The potential for hallucinations between ChatGPT vs V12 are totally different, because everything about the data is different, so it's difficult to make that analogy wrt hallucinations.

For example, ChatGPt is trained on everything from fiction novels to encyclopedias and novels in general, so it has lies, fantasy, sci-fi, and everything in between.
 
  • Like
Reactions: JB47394
I'm only pointing out the differences in the training data and also inputs during inference.

The potential for hallucinations between ChatGPT vs V12 are totally different, because everything about the data is different, so it's difficult to make that analogy wrt hallucinations.

For example, ChatGPt is trained on everything from fiction novels to encyclopedias and novels in general, so it has lies, fantasy, sci-fi, and everything in between.
That was my point .. CharGPT is NOT a good example, as I noted in my post. I still remain unclear on what you mean by this:

"V12 is different in that the data is well-massaged / curated with a tight feedback loop on failures with the data engine. V12's data consists of extreme examples to the mundane. Also, there's no flawed human prompts, the "prompts" are the same types of data-rich pixel streams as its training set."

Care to expand on this?