Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
That's not an example of "good" that's an example of "familiar".

How can any non-local ever drive through those situations? How does any driver become a local?

By adding notes to their internal map.

You're right, I think a good but unfamiliar driver would err on the side of getting in the line, assess the situation and then move out when he sees that it's just a line for the Costco rather than the left turn ahead that he wants to make.

This still seems impossible for V12 to reason out. I guess the alternative is just to skip that left turn and reroute ahead. Or attempt to squeeze in near the end of the line.

All of these options seem difficult to generalize though with the V12 approach.
 
  • Like
Reactions: enemji
You're right, I think a good but unfamiliar driver would err on the side of getting in the line, assess the situation and then move out when he sees that it's just a line for the Costco rather than the left turn ahead that he wants to make.

This still seems impossible for V12 to reason out. I guess the alternative is just to skip that left turn and reroute ahead. Or attempt to squeeze in near the end of the line.

All of these options seem difficult to generalize though with the V12 approach.
That may be handled at the navigation level. If it can't get in the left turn lane it wants to, keep going and circle back. Not the most efficient route, but a route none the less..
 
I saw the jjricks videos you're referring to. The car doesn't just see a block and make a decision by itself. When software makes a decision by itself, it's basically instantaneous. When there's a remote intervention or approval involved, there's a delay. Simple as that.

The root problem is we have no way to conclude one way or another because Waymo is secretive. But we can make educated guesses based on what we know about software and other info.

With that logic you might aswell call every action by the car "remote intervention", including the ones below:

 
On the topic of localization, it seems Tesla will have a very difficult time solving the following:

1) in some locales, a good driver will see a line of cars and get behind it because he knows all those cars are waiting for the left turn ahead

2) in another locale in a similar situation, a good driver avoids the line because he knows it's just a line of people waiting to turn into the Costco

How will Tesla ever be able to solve both cases?
Exactly why localization is a must
 
Please someone tell me my interpretation of Douma is wrong and why :)

It comes down to time. It is only a few months work to replace the "control" heuristics with a NN. It is years or decades to produce one end-to-end NN that does everything. Some of that computer science doesn't even exist yet.

Instead, they'll have all the foundations they've built already, and the smoothness and control capabilities of a NN. This needs to ship by end of year, get Europe and China running FSD, then they move on to HDW 4, all BEFORE the vast majority of DOJO compute is online.

No way they redo everything. Elon is just talking out his ass about a rewrite of everything. They physically are not capable of doing that.

There's three main NN functions:
  1. Perception (that's "Tesla Vision")
  2. Planning (that's mostly the "language of lanes" based on LLMs tech), and
  3. Control (that was 300+K lines of C++ code, with v12 it will be a NN driving.
If an analogy helps you understand things better, think of how a rally race car team works:
there is a Driver and a Navigator. The Navigator looks at the maps, know where they're going, and follows the progress, giving timely instructions like turn left, get in the center lane, slow down dumbass.

Passing these instructions takes a certain amount of time, the Navigator must give the Driver time to react, and to make things happen. But it is the driver that twitches with the steering wheel when a big damn rock appears over the crest of a hill, he doesn't wait for the Navigator to tell him he needs to react. Oh, and they both have eyes (Tesla Vision), it's just their roles and responsibilities are separate, but they work as a team. The driver in realtime, doing the twitchwork, the navigator following progress closely and giving instructions, while gathering information from maps, signs, weather, road conditions, traffic-alerts, messages from passengers, emergency vehicles and flagmen, ... ;)

Should I post a video here? :D
 
Elon specifically said that they are designing FSD to be able to operate w/o a cellular data connection. That means you can not depend on access to remote operators.
Waymo doesn’t need an always connect cellular connection and they still have use remote ops. Cruise is on their own with that requirement.
It comes down to time. It is only a few months work to replace the "control" heuristics with a NN. It is years or decades to produce one end-to-end NN that does everything. Some of that computer science doesn't even exist yet.

Instead, they'll have all the foundations they've built already, and the smoothness and control capabilities of a NN. This needs to ship by end of year, get Europe and China running FSD, then they move on to HDW 4, all BEFORE the vast majority of DOJO compute is online.

No way they redo everything. Elon is just talking out his ass about a rewrite of everything. They physical are not capable of doing that.

There's three main NN functions:
  1. Perception (that's "Tesla Vision")
  2. Planning (that's mostly the "language of lanes" based on LLMs tech), and
  3. Control (that was 300+K lines of C++ code, with v12 it will be a NN driving.
If an analogy helps you understand things better, think of how a rally race car team works:
there is a Driver and a Navigator. The Navigator looks at the maps, know where they're going, and follows the progress, giving timely instructions like turn left, get in the center lane, slow down dumbass.

Passing these instructions takes a certain amount of time, the Navigator must give the Driver time to react, and to make things happen. But it is the driver that twitches with the steering wheel when a big damn rock appears over the crest of a hill, he doesn't wait for the Navigator to tell him he needs to react. Oh, and they both have eyes (Tesla Vision), it's just their roles and responsibilities are separate, but they work as a team. The driver in realtime, doing the twitchwork, the navigator following progress closely and giving instructions, while gathering information from maps, signs, weather, road conditions, traffic-alerts, messages from passengers, emergency vehicles and flagmen, ... ;)

Should I post a video here? :D
I too believe that Tesla isn’t doing end to end as Elon claims but actually Mid to Mid as Waymo outlined years ago with chaufferNet

It still satisfies all of Elon’s quotes like “no one told it to stop at signs…etc” or “no one told it what a traffic light is”

 
It comes down to time. It is only a few months work to replace the "control" heuristics with a NN. It is years or decades to produce one end-to-end NN that does everything. Some of that computer science doesn't even exist yet.

Instead, they'll have all the foundations they've built already, and the smoothness and control capabilities of a NN. This needs to ship by end of year, get Europe and China running FSD, then they move on to HDW 4, all BEFORE the vast majority of DOJO compute is online.

No way they redo everything. Elon is just talking out his ass about a rewrite of everything. They physical are not capable of doing that.

There's three main NN functions:
  1. Perception (that's "Tesla Vision")
  2. Planning (that's mostly the "language of lanes" based on LLMs tech), and
  3. Control (that was 300+K lines of C++ code, with v12 it will be a NN driving.

Yes, with human logic and heuristics, you make sense, but what you're saying isn't congruent with what was said during the livestream. The James Douma interviews should have taken specific quotes from the livestream and analyzed them, rather than speculation about e2e in general.

For example, how do we reconcile this statement by Ashok wrt perceptual objects during the livestream?

Ashok: "Internal to its mind, it might know all these concepts and how we think about it and lanes and like labels and things like those, but we have just not explicitly asked for it." 23:45

You don't simply eliminate 297k lines of planning / control code without a major change to perception itself. The whole approach has been changed.
 
  • Informative
Reactions: APotatoGod
Yes, with human logic and heuristics, you make sense, but what you're saying isn't congruent with what was said during the livestream. The James Douma interviews should have taken specific quotes from the livestream and analyzed them, rather than speculation about e2e in general.

For example, how do we reconcile this statement by Ashok wrt perceptual objects during the livestream?

Ashok: "Internal to its mind, it might know all these concepts and how we think about it and lanes and like labels and things like those, but we have just not explicitly asked for it." 23:45

You don't simply eliminate 297k lines of planning / control code without a major change to perception itself. The whole approach has been changed.
From the biography, the 300k was the path planning code that could potentially be replaced with a NN that they took for a test drive in April. Same from.end perception stack (AFAICT).
 
Yes, with human logic and heuristics, you make sense, but what you're saying isn't congruent with what was said during the livestream. The James Douma interviews should have taken specific quotes from the livestream and analyzed them, rather than speculation about e2e in general.
Elon did the livestream on a whim, was 4 hrs late to start, then filmed it on an iPhone. How much 'prep' do you think he actually did for this event which from which you are dissecting quotes? Here's an Elon quote for you "I'm wrong often." Better to focus on the technicals.

What you're ignoring is that there is not enough compute in the universe to do a complete end-to-end retraining starting with zero weights. IT HAS TO BE based on prior work, all b.s. aside.

For example, how do we reconcile this statement by Ashok wrt perceptual objects during the livestream?

Ashok: "Internal to its mind, it might know all these concepts and how we think about it and lanes and like labels and things like those, but we have just not explicitly asked for it." 23:45
How is that a conflict? They didn't change the "Planner" neural net, just interfaced it to the new "Control" NN.

You don't simply eliminate 297k lines of planning / control code without a major change to perception itself. The whole approach has been changed.
Pardon? The 300+K lines of control code have no perception capabilities whatsoever. It just reacts to inputs provided from the planner. And in v11 the planner takes imputs from perception to create its occupancy network. It remains to be seen if this function is retained, or re-architected, but it is and will remain separate from the "Control" NN, which is the v12 innovation.

If you need to hang on to something Elon said during the livestream, he said "nothing but nets, baby". As in plural, more than 1 net. Not a single end-to-end net. Most informed commenters take this to mean that all the work related to driving in v12 will be done by NNs, and they are separated by function into 3 areas:
  1. Perception
  2. Planning, and
  3. Control
It shouldn't surprise you that Elon didn't communicate that clearly on Aug 24th. The ad-hoc nature of the livestream should inform you of that. Rember what he said as he was apologizing for the dead space, and repetition, during the 45 min drive? He said 'somebody else will fix that.' And promptly, Youtubers released 'supercuts' of the livestream.

TL;dr Don't overthink what Elon said, 'cuz he didn't.
 
Last edited:
Elon specifically said that they are designing FSD to be able to operate w/o a cellular data connection. That means you can not depend on access to remote operators.

I think there is a difference between an AV that requires a constant data connection to drive all the time and the AV having the ability connect to remote operators if needed. Elon does not want the car to need a constant cellular connection all the time for the obvious reasons that you do not want the AV dependent on a cell signal to drive. Otherwise, you could run into a situation like Cruise had where all your AVs stall when the network goes down. So he is right about that. But AVs will never be perfect. It is inevitable that they will get "stuck" at some point, it is just a matter of how rare it happens. So if Elon thinks that Tesla will simply train FSD until it never gets stuck, he is fooling himself. Now, if there is a human in the driver seat, they can take over if the car gets stuck. So you can do "eyes off" FSD in consumer cars with a human in the driver seat and you do not need remote assistance. And as we see with FSD beta, you can have the human in the driver seat be the "safety driver" to take over as needed. So you do not need remote assistance while you are developing FSD. And you can keep a safety driver until the system is ready to go driverless. But if you want to do driverless, like robotaxis, or have a car with no steering wheel or pedals at all, then you absolutely need remote assistance. With no human in the driver seat or no steering wheel or pedals, you need a way to control the car if it gets stuck or in an emergency. The only solution is remote assistance. So if you are doing driverless/robotaxis, you need the ability to connect to a remote operator if the robotaxi has a problem, you just don't need a constant data connection in order to drive all the time.
 
  • Like
Reactions: GSP
ChatGPT and similar have spawned a "field" of prompt engineers, who find the best ways to massage the correct outputs from gpts.

One of the reasons ChatGPT "hallucinates" is because our human-made prompts are "flawed" to the gpt. For example, if you ask, "did lincoln eat an apple in 1850," this input into the gpt is very data sparse, not based in reality, not "real world" so to speak. It's a flawed prompt after all.

The fundamental reason the LLMs 'hallucinate' is that they have no valid internal knowledge that distinguishes "this is real" vs "this is something that I've read". The distinction that even children understand as they can truthfully say something is "pretend" vs "real".

It's a major philosophical and technical problem.

V12 only deals with real-world data. Its data streams used for inference are only based in the real world and are a million times more data rich than a human-made prompt to gpt.
The prompt isn't the issue here.

The sentence "did lincoln eat an apple in 1850," is a only couple of bytes at once instance in time, where as an 8 camera raw data stream is gigabytes per second.
The high bit rate on the camera can make a generative AI produce plausible notions of "what it would look like if I were to go *there*", literally the target loss function for generative video modeling, but it doesn't help the underlying problem is "*should* *i* *want* to go there, or somewhere else".

The key problems are not on the perception side (though high resolution cameras are needed), they are on the policy side, and a generative video architecture doesn't help. The inputs to the policy (where did the car drive when humans did it) are comparatively very sparse and do not sample the space of potential pathways and evaluate them. The low bit rate and low information coverage of this has been mentioned by Karpathy as a significant deficiency. He's made some comments on Twitter which indirectly seem to doubt the success of a completely blind end-to-end ML system.
 
Last edited:
  • Like
Reactions: GSP and powertoold
You're right, I think a good but unfamiliar driver would err on the side of getting in the line, assess the situation and then move out when he sees that it's just a line for the Costco rather than the left turn ahead that he wants to make.

This still seems impossible for V12 to reason out. I guess the alternative is just to skip that left turn and reroute ahead. Or attempt to squeeze in near the end of the line.

All of these options seem difficult to generalize though with the V12 approach.
The hack around this, in a densely populated area, is crowdsourced routing. I.e. not a neural network approach, but a big data approach.

Back at the Mothership Cloud, the server would know that most people who wanted to end up in a location which is along your desired route (say a waypoint past a turn), on average made lane choices of *this* here 75% of the time. The autodriver would then choose the most commonly chosen path as its preferred route.

So in the case above, it wouldn't be able to reason through the reality of "this is Costco", but it would know that on average human drivers who needed to get further do not typically get in that lane.

This is an intensive data-heavy solution, but could work in areas where there are many Teslas driving and recording and uploading. It would need high quality sub-lane localization data.

It would have to download lane routing suggestions when you started the navigation. Its a complex programming problem but would make use of the high density of Teslas.