Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
I can’t see there being any chance of it getting certified in Europe without being able to evidence its intent so the driver can intervene if it’s wrong. Bear in mind NOA is not even permitted to make lane changes autonomously here, it has to request the change and have the driver actively confirm it first.

That's visualization. The software is working with changes and moving objects so has to have things figured out more than long enough into the future for human reaction.
 
That's visualization. The software is working with changes and moving objects so has to have things figured out more than long enough into the future for human reaction.
But isn't that the problem with the E2E approach? Video goes in, driving decisions (apply brakes, turn wheel etc) come out. There's no intermediary state where the car has formed a plan, it just 'knows' what the right thing to do is. I thought that was basically the definition of E2E.

Unless I've got this wrong, which is very possible. I know very little about how NNs work.
 
I must admit as a heuristics-based programmer, I have little idea how V12 is architected or works.

If what Elon says is true (V12 is being tested internationally), and his livestreamed drive is what he says it is (end-to-end approached), my mind is blown.

When I think about nothing but NNs, things like GPT4 come to mind. But GPT4 has a tendency to imagine things or make things up (hallucinate), and it's not great at being accurate or precise, which are important for self-driving.

However, one of the main reasons why GPT4 hallucinates is because the input parameters (human prompts) are very very low data and sparse inputs.
Hallucinations can also happen due to noise in the training data or ambiguous statistical patterns. The problem is that currently there is no systematic way to identify the exact cause of specific hallucinations in models of non-trivial size, which means there is also no systematic way to fix them. All you can do is try to use different and/or more training data, or apply manually designed restrictions to the output data (there is some interesting research to use different models to "supervise" each other, but that's still early).

I don't know what kind of models Tesla uses for driving policy, so it's not clear to me how what we see with LLMs translates to them. However, the fact that a hallucinating car can potentially kill people (which is not the case for ChatGPT, at least as long as it doesn't get access to the nuclear codes 😉) is certainly concerning.
 
I can’t see there being any chance of it getting certified in Europe without being able to evidence its intent so the driver can intervene if it’s wrong. Bear in mind NOA is not even permitted to make lane changes autonomously here, it has to request the change and have the driver actively confirm it first.
Declaring intent would be a great thing, especially in the early stages of autonomy. I've suggested as much myself.

Confirming actions just isn't going to scale. For now, sure, because autonomy isn't very advanced. But we're rapidly approaching the time where most actions planned by the software are going to be the correct one, and if the driver is obliged to confirm everything it does then drivers will simply turn it off. I'm trying to imagine confirming any turns, stops or starts that my car does on my drives. I certainly wouldn't use it for long.

But isn't that the problem with the E2E approach? Video goes in, driving decisions (apply brakes, turn wheel etc) come out. There's no intermediary state where the car has formed a plan, it just 'knows' what the right thing to do is. I thought that was basically the definition of E2E.
What is the concern? That it works, but we can't point to a line of code that controls a given behavior? If so, isn't it more important that there is a means of solving problems that pop up? As I see things, it is when problems cannot be solved that we need to be worried. For example, will "hallucinations" be a problem in autonomy?
 
But isn't that the problem with the E2E approach? Video goes in, driving decisions (apply brakes, turn wheel etc) come out. There's no intermediary state where the car has formed a plan, it just 'knows' what the right thing to do is. I thought that was basically the definition of E2E.

Unless I've got this wrong, which is very possible. I know very little about how NNs work.
You've got things wrong. E2E approach does not mean the software is not forming a plan, it just means that it's just one NN doing all the task/subtask of perception, prediction and planning instead of disparate modular NNs.

pipeline.png
 
  • Informative
Reactions: Mullermn
How would you ever safety validate an AI system that is basically "we don't know how but it works like magic"
There are many working on interpretability for all sizes of neural networks including Wayve examining intermediate states/layers of their module-less end-to-end learning. Karpathy has shown some of these hidden states with feature channel activity visualizations at AI Day 2021. So for a particular video input (and fixed weights), one could track down what resulted in a control decision and even adjusting inputs to see how the outputs change.

It's not totally clear, but it seems likely Tesla is building on top of their existing perception modules that output intermediate vector space values that feed into control (for full end-to-end), visualizations, shadow mode triggers, etc. These intermediate outputs should simplify training control and improves interpretability as they're explicitly human-defined features, so one set of failures could be tracked back to a perception issue. And along with other interpretability, it's less "magic" and "just math."

Even for traditional code, tracking down what out of 300k+ lines of code is causing some behavior like entering on a red light isn't as straightforward as "code says traffic light is red, so never enter" as there could be some special case path for giving space to emergency vehicles or obeying directed traffic hand signals or even some general collision avoidance behavior.

Practically with traditional code and especially for end-to-end controls, there will need to be a suite of validation techniques from comprehensive automated scenario testing, shadow mode evaluation, adversarial simulator scenarios potentially with fuzzing/random approaches, and real-world deployment with billions of miles experience.
 
But isn't that the problem with the E2E approach? There's no intermediary state where the car has formed a plan, it just 'knows' what the right thing to do is. I thought that was basically the definition of E2E.
End-to-end learning allows training neural networks with target outputs all the way to their matching inputs. One extreme is all intermediate states are "hidden" and another form is to have neural network modules with intermediate outputs/states that could be jointly trained. For example, potentially the V12 demo still showing visualizations for lanes and objects means these are still using intermediate outputs that are also passed along to the new control network.
 
You've got things wrong. E2E approach does not mean the software is not forming a plan, it just means that it's just one NN doing all the task/subtask of perception, prediction and planning instead of disparate modular NNs.

pipeline.png
When Musk says v12 is "Nets all the way baby" I think he means (at least) one NN is doing perception (occupancy network and vector space), another is doing planning and another is doing control.

So the output of one NN is the input of the next. This explains why the car can still show a visualisation of what it sees and acts upon.

I believe that is what Ashok also talked about. Not one NN to do it all. But v12 is "only NNs, no heuristics" as opposed to "many NNs and some heuristics" in v11.


Edit: if the above is correct, in theory one could go another step forward and build one giant NN but:
- I dont know if you could still see what the car (thinks it) sees;
- You lose the ability to for example use the perception NN worldwide but have a separate planning NN in each jurisdiction to account for differing rules of the road.
So v13 is not a great idea I guess.
 
I think exactly how this sort of thing will manifest in driving is unclear (to me at least). But I also think that occasional unexplained (or easily explained by a human) errors are likely with an unconstrained approach
Mispredictions for control can be similar in nature to incorrect perception. E.g., the earlier example of mailboxes predicted as pedestrians is likely because the training data happens to have a good amount of pedestrians standing in front of mailboxes and the neural network incorrectly learns to associate mailboxes as a signal.

So for control, a common behavior could be that you start to accelerate when another vehicle next to you starts to accelerate, and this could have been what happened for the entering red light scenario in the demo with that "move forward with other vehicles" signal being stronger than "stop at red light." Potentially Tesla in these early stages of end-to-end took a random sampling of driving video without much curation, and it could even have included examples of humans incorrectly entering on red when the left turn lane starts to move forward.
 
Are you skeptical that Musk and Elluswamy demoed something still mostly being controlled with traditional code instead of neural networks? Or that you're skeptical that when they're able to actually release something, there will need to be a lot of traditional code added back?

What do you think about the repeated comments from the live stream:

There's no line of code that says slow down for speed bumps​
There is no line of code that says give clearance to bicyclists​
There is no line of code that says stop at a stop sign, wait for another car, who came first, wait X number of seconds​
We have never programmed in the concept of a roundabout​
We just showed a whole bunch of videos of roundabouts​
The mind-blowing thing is that there are no there there's no heuristics​
It doesn't know what a scooter is​
It doesn't know what paddles are​
There is no line of code that says this is a roundabout​
There is not nothing that says wait X number of seconds​
Just because there's no lines of code doesn't mean that it's uncontrollable​
It's still quite controllable on what you want by just adding data now​
We've never programmed in the notion of a turn lane or even a lane​
There's no line of code about traffic lanes at all​
Based on the video that's received that at the end of the destination you pull over to the side and park​

With the various examples of not having code, it would seem like they're quite aware of all the intricate control logic that has existed up through 11.x including specially handling crosswalk paddles that are placed in the middle of the road reminding people that state law requires stop/yield for pedestrian within crosswalk.
I'm skeptical about that number - 99%. And what it really means. We know EM is well known for exaggeration.

For eg. there was supposed to be some 300k lines of code in V11. Does 99% mean - now there is just 3k lines of code ?

Does it mean - if there are 100 features in planning, 99 have been moved to NN ?

Do they just input the route to be taken and NN figures out everything else ? How about traffic lights ... right turn on red ?

How about school zones - which they don't handle now. Will NN handle that in future ?

BTW, as for roundabouts, I hope NN handles them better than they do now with code ;) Vast majority of my interventions are now to do with roundabouts. The random lane changes have mostly been fixed in 4.4.

Anyway - what I'd like to hear is .... what is still controlled by code. That tells us clearly what is done by NN. My sense is that short term planning is done by NN (exact path to take when driving on a particular road). Some of that was already done using NN (or demonstrated on AI day). In other words - V12 is more evolutionary rather than revolutionary.
----

I'm also somewhat skeptical of NN being able to achieve the kind reliability needed for non-supervised driving. We are talking about making no more than one mistake per year !
 
  • Love
Reactions: Bladerskb
When Musk says v12 is "Nets all the way baby" I think he means (at least) one NN is doing perception (occupancy network and vector space), another is doing planning and another is doing control.

So the output of one NN is the input of the next. This explains why the car can still show a visualisation of what it sees and acts upon.

Technically speaking, there's no reason why an end-to-end network needs to have only one type of output. Everyone's assuming that the visualizations are taken from an intermediate network in the middle of the process, but you could have:

- Video in
- Controls and visualizations (and semantic planning explanations [e.g. "Stopping for stop light"]) out simultaneously

Internally, the planning/control network would be aware of the space around it one frame ahead of the visualization; but the human in the driver seat isn't any worse off for being given the visuals with a 1/36 second delay.
 
V12 makes little sense to me, but then again, there are few details about it so far.

With the info we have so far, it seems V12:

1) No longer makes use of simulation
2) Moves the heuristics upstream (in data curation parameters) vs downstream (featuring engineering based on perception)
3) Can no longer be "super human" because it only makes use of videos from "good" drivers. It is only "super human" in that it can potentially drive like a "good" human all the time.
4) Whatever happened to "360"-degree cameras that have better vantage points vs humans? There are certain situations where the cameras can see an oncoming car that a human can't.
 
  • Like
Reactions: Goose66
1) No longer makes use of simulation
2) Moves the heuristics upstream (in data curation parameters) vs downstream (featuring engineering based on perception)
3) Can no longer be "super human" because it only makes use of videos from "good" drivers. It is only "super human" in that it can potentially drive like a "good" human all the time.
4) Whatever happened to "360"-degree cameras that have better vantage points vs humans? There are certain situations where the cameras can see an oncoming car that a human can't.

I agree with your point 2, but I'm not sure where you're getting the other points.

On point 1, Tesla can still make use of simulation for regression testing. This network literally just takes video inputs, so the simulator would only need to simulate those camera view points in virtual environments. It's unclear at the moment whether simulated training data would enhance learning, but it's possible. People used to think that LLMs would degrade when training on LLM-generated text, but so far the opposite has proved true.

On point 3, even if v12 only ever learned to drive like the average good driver in their training data, it's still super human because it never gets distracted, never gets tired, can begin responding to a situation in 1/36th of a second, and is looking in all directions at once. The average human reaction speed is ~0.2 seconds. HW3's reaction speed should be a magnitude higher at ~0.02 seconds.

On point 4, HW3 does have full 360 coverage. There's not much overlap between the viewpoints, but as soon as an object leaves one view it appears in another. And most of the cameras are mounted above a typical driver's vantage point, so it has a better view.
 
Technically speaking, there's no reason why an end-to-end network needs to have only one type of output. Everyone's assuming that the visualizations are taken from an intermediate network in the middle of the process, but you could have:
The visualization data probably comes multiple networks, according to their respective tasks. Surely the shortest compute path is preferred; pull the visualization data out as soon as it has been computed and then don't send it any farther through the system.
 
I agree with your point 2, but I'm not sure where you're getting the other points.

On point 1, Tesla can still make use of simulation for regression testing. This network literally just takes video inputs, so the simulator would only need to simulate those camera view points in virtual environments. It's unclear at the moment whether simulated training data would enhance learning, but it's possible. People used to think that LLMs would degrade when training on LLM-generated text, but so far the opposite has proved true.

On point 3, even if v12 only ever learned to drive like the average good driver in their training data, it's still super human because it never gets distracted, never gets tired, can begin responding to a situation in 1/36th of a second, and is looking in all directions at once. The average human reaction speed is ~0.2 seconds. HW3's reaction speed should be a magnitude higher at ~0.02 seconds.

On point 4, HW3 does have full 360 coverage. There's not much overlap between the viewpoints, but as soon as an object leaves one view it appears in another. And most of the cameras are mounted above a typical driver's vantage point, so it has a better view.

1) No longer makes use of simulation
>> I mean this in the context of video training data. In the past, Tesla used simulation video/images in their training sets (AI Day).

On point 3, even if v12 only ever learned to drive like the average good driver in their training data, it's still super human because it never gets distracted, never gets tired, can begin responding to a situation in 1/36th of a second, and is looking in all directions at once. The average human reaction speed is ~0.2 seconds. HW3's reaction speed should be a magnitude higher at ~0.02 seconds.
>> Tesla can no longer depend on the reaction time of the perception NN if they are only training on human "good" drivers. There's no way for them to "simulate" a different outcome (i.e. increase the reaction time of the "good" driver training videos).

On point 4, HW3 does have full 360 coverage. There's not much overlap between the viewpoints, but as soon as an object leaves one view it appears in another. And most of the cameras are mounted above a typical driver's vantage point, so it has a better view.
>> If Tesla is only including "good" driver videos in the training set, then the 360-degree cameras no longer provide any benefit vs human two eyes on a slow gimbal.

I know most of my answers are missing a lot of context. But the gist is that with V12, it seems Tesla has thrown out the baby with the bathwater and gone a totally different direction. The whole approach relies on real-world videos of good drivers and essentially imitation learning based on these videos. Based on my intution, there's no way to "fake" or simulate these videos. If you want a clean NN that works in the real-world, you need to use real-world videos. There's no way to simulate the nuances of individual pixels.
 
  • Like
Reactions: JB47394
With the caveat that human drivers have both peripheral vision and mirrors, so we really do have 360-ish vision. It should suffice as a starting point for training.

I agree, I think if the dataset is well-curated, the result will be a very very good and attentive human-like driver, but again, it won't be "super human" in the way that a 36hz "360" camera would be.
 
Tesla can no longer depend on the reaction time of the perception NN if they are only training on human "good" drivers. There's no way for them to "simulate" a different outcome (i.e. increase the reaction time of the "good" driver training videos).

If Tesla is only including "good" driver videos in the training set, then the 360-degree cameras no longer provide any benefit vs human two eyes on a slow gimbal.

I think you might have a fundamental misunderstanding of how neural networks generalize.

V12 is not a parrot. It's not recording exact movements and playing them back. For any given timeframe of video data, it's trained to predict the subsequent frames of video for successfully maneuvering the situation, and the controls necessary to achieve that position. If, in one frame, it detects a red-light-running vehicle about to intersect with its path, it will predict the controls necessary to avoid a collision. Even if it's never seen a case of a vehicle running a red-light. All it needs to have been trained on is data of drivers correctly judging trajectories and avoiding intersecting paths of other vehicles.

I hate to continue drawing parallels to LLMs, but the above is like saying "It takes me 2 days to write an essay, so ChatGPT cannot write an essay in less than 2 days."