Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
The assumption is that timing is just as important as any other parameter of the training data. What is it about the training process that would encourage the system to want to react faster than the data as presented? Why not crowd a lane divider more tightly on a turn? Why not accelerate away from a traffic light more quickly? The idea of departing from the training parameters seems non-intuitive.

If you can accept that v12 can already generalize with respect to things like location, shape, color, shadow, why is it so hard to understand that it could also generalize with respect to timing?

We saw Elon's live stream. V12 seemed to have a pretty good handle on what other cars looked like and that it shouldn't try to drive through them. We know it's not remotely possible for Tesla to have collected training data on every make, model, and color of car for the network to recognize them.

Part of the mechanism that could allow it to generalize with respect to time is being able to see the "future" of training data. Even if a human in the training data doesn't respond quickly, the network can see the future consequences of not responding quickly to those types of scenarios; and will learn to respond to them faster to avoid them.
 
Part of the mechanism that could allow it to generalize with respect to time is being able to see the "future" of training data. Even if a human in the training data doesn't respond quickly, the network can see the future consequences of not responding quickly to those types of scenarios; and will learn to respond to them faster to avoid them.

I think you're attributing characteristics of the training set that haven't been discussed by Elon or Ashok, etc.

The training set is simple based on what we know: videos of good drivers along with all the current vehicle metadata. The NN is asked to output the vehicle controls based on the stream of video / metadata it's given.

Any "future" prediction is only based on what the human drivers predicted in the videos. The NN is confined by what it is trained on. It doesn't understand "consequences" of any actions and doesn't understand what a future is or means.

V12 is essentially a very "dumb" pixel pattern recognizer that has no understanding of human concepts like cars or lanes, etc. It's just very good at recognizing the smallest nuances in pixel flow and outputting vehicle controls to generalize what good human drivers did in those cases.
 
  • Like
Reactions: Goose66 and JB47394
If you can accept that v12 can already generalize with respect to things like location, shape, color, shadow, why is it so hard to understand that it could also generalize with respect to timing?
The distinction lies in inputs versus outputs. Generalizing on inputs is mainstream neural network stuff; pattern matching. In contrast, I'm not familiar with extrapolation techniques on outputs. I know that they exist, but I get the impression that they are not an emergent behavior of a network.
 
V12 is very fascinating and may be the story for why people like karpathy and kate park (AI Day 2 project manager for the data engine) left Tesla. There's also John Emmons, who talked about the language of lanes during AI Day and is nowhere to be seen nowadays.

V12 is the realization that Tesla can't create heuristic rules that cover all driving environments across the world. The task is too daunting and to have one code base covering all these situations is difficult to maintain.

Instead, V12 will recognize nuanced pixel patterns that give it locale context as well. So it will be able to recognize that it's in Paris for example and follow driving patterns of good drivers there. Tesla will essentially include thousands/millions of driving videos of all locales in the world in the dataset, and they'll distill it into a predetermined # of parameter NN (~1 billion).
 
  • Like
Reactions: Captkerosene
My take on the recent v12 news is very positive. Honestly quite exciting and should generate optimism - that's not to say victory is right around the corner.

In the first page of this thread, I posted a concern, well placed at the time, that Elon might be misappropriating the term " end-to-end" to mean simply that every software block, in the complex architectural flow diagram, would be implemented with localized machine learning methodology. Which is not the same as real end-to-end training and solution finding.

However, the latest descriptions recorded by Elon and Ashok, coupled with other tidbits* from Elon's characteristically enigmatic tweets, have me pretty well convinced that v12 is, after all, a real example of end-to-end system implementation.

* What comes to mind are a few things:​
  • His seemingly new attitude that engineering resources are not the constraint now but training bandwidth clearly is, which I found a bit puzzling on the heels of two consecutive AI Days that were openly and strongly identified as AI-engineer recruitment events. In the recent context this shift in emphasis makes more sense (though I'm sure they still want to grow with top-notch engineering talent)
  • The apparent loss of momentum with the 11.4.x releases. Obviously it's impossible to know that this represents some level of resource shifting rather than simply failure to squash some stubborn bugs - I can only say that it feels like development of this branch is now half-hearted. In this context, that's not annoyed whining there's plenty of that in the v11 discussion thread) but just an observation about the slow pace, just at a time when there would otherwise be real urgency to bring the latest and greatest FSD to 2023 production on both HW3 and (in particular) HW3 5 / 4
  • Elon's comments about the team coming to a new realization leading to a fundamental simplification, but more lamenting what they've been overlooking, rather than bragging about the brilliance of a new breakthrough. Sorry I cannot find this tidbit right now but I think it's fairly recent, maybe a verbal quote rather than a tweet.
The list of Elon's hints above is interesting in retrospect, but really I think Ashok's simple comment about programming through data instead of code. No it's not original insight, but it's concise, a simple clarification coming from someone who has enormous credibility and personal investment in all kinds of clever code implementation to solve complex tasks.
(I shouldn't have to spend much time in his defense, but I'll just note the unbelievably insulting and contemptible denigrations of this man's capability and motivations just in this thread snd at leastoneother. I've tried to make some effort to see the positives from the bitterest and most cynical critics here, but it's just amazing what kind of mudslinging we see from dedicated scoffers. Skepticism is fine, but in my book you will never, ever, be counted as "right" by couching your arguments in that kind of childish smearing.)​

So, although I wasn't sure at the time, I officially take back the implication that it was a misuse of the term "end-to-end. I no longer think that, even if we could proceed to a third order discussion of what that term does or should mean.

What I got from the video is that it was remarkably competent, convincingly so, and very encouraging for what is obviously a true alpha version. Some suggestions over the weekend that the driving was actually not very skillful in general, I frankly disagree with. Some of the cired examples are simply not bad driving, others are arguably improper, but so prevalent in human driving that it becomes pointless or hypocritical to object to the car doing things that nearly everyone does, nearly all the time. Of course it wasn't flawless, to me that wasn't the point. It was/is an incredibly significant existence proof of a very different approach; credit and not derision is called for when the development team can re-examine their basic assumptions and put serious effort into something that upends so much prior work.

I have a lot of questions about how the inputs, outputs and required overrides are interfaced with the system ( the rather sparse conversation in the video suggests that overrides can only be done through distorted-data training, but I'm not sure that principle will hold). This also brings up the sticky problem of misplaced regulatory requirements vs appropriate caution and oversight. It may be too much to hope that appointed techno-bureaucrats can be open minded in the way that I shock and team seem to be, but I hope they can make themselves part of the solution as everyone is learning how this new technology really needs to work.

Overall, I'm very much looking forward not only to continued demonstrations and then early releases, but also to presentations that we occasionally get from Tesla. I think the content of those will be very different but I want to see what they can tell us about what they've learned.
 
My take on the recent v12 news is very positive. Honestly quite exciting and should generate optimism - that's not to say victory is right around the corner.

In the first page of this thread, I posted a concern, well placed at the time, that Elon might be misappropriating the term " end-to-end" to mean simply that every software block, in the complex architectural flow diagram, would be implemented with localized machine learning methodology. Which is not the same as real end-to-end training and solution finding.

However, the latest descriptions recorded by Elon and Ashok, coupled with other tidbits* from Elon's characteristically enigmatic tweets, have me pretty well convinced that v12 is, after all, a real example of end-to-end system implementation.

* What comes to mind are a few things:​
  • His seemingly new attitude that engineering resources are not the constraint now but training bandwidth clearly is, which I found a bit puzzling on the heels of two consecutive AI Days that were openly and strongly identified as AI-engineer recruitment events. In the recent context this shift in emphasis makes more sense (though I'm sure they still want to grow with top-notch engineering talent)
  • The apparent loss of momentum with the 11.4.x releases. Obviously it's impossible to know that this represents some level of resource shifting rather than simply failure to squash some stubborn bugs - I can only say that it feels like development of this branch is now half-hearted. In this context, that's not annoyed whining there's plenty of that in the v11 discussion thread) but just an observation about the slow pace, just at a time when there would otherwise be real urgency to bring the latest and greatest FSD to 2023 production on both HW3 and (in particular) HW3 5 / 4
  • Elon's comments about the team coming to a new realization leading to a fundamental simplification, but more lamenting what they've been overlooking, rather than bragging about the brilliance of a new breakthrough. Sorry I cannot find this tidbit right now but I think it's fairly recent, maybe a verbal quote rather than a tweet.
The list of Elon's hints above is interesting in retrospect, but really I think Ashok's simple comment about programming through data instead of code. No it's not original insight, but it's concise, a simple clarification coming from someone who has enormous credibility and personal investment in all kinds of clever code implementation to solve complex tasks.
(I shouldn't have to spend much time in his defense, but I'll just note the unbelievably insulting and contemptible denigrations of this man's capability and motivations just in this thread snd at leastoneother. I've tried to make some effort to see the positives from the bitterest and most cynical critics here, but it's just amazing what kind of mudslinging we see from dedicated scoffers. Skepticism is fine, but in my book you will never, ever, be counted as "right" by couching your arguments in that kind of childish smearing.)​

So, although I wasn't sure at the time, I officially take back the implication that it was a misuse of the term "end-to-end. I no longer think that, even if we could proceed to a third order discussion of what that term does or should mean.

What I got from the video is that it was remarkably competent, convincingly so, and very encouraging for what is obviously a true alpha version. Some suggestions over the weekend that the driving was actually not very skillful in general, I frankly disagree with. Some of the cired examples are simply not bad driving, others are arguably improper, but so prevalent in human driving that it becomes pointless or hypocritical to object to the car doing things that nearly everyone does, nearly all the time. Of course it wasn't flawless, to me that wasn't the point. It was/is an incredibly significant existence proof of a very different approach; credit and not derision is called for when the development team can re-examine their basic assumptions and put serious effort into something that upends so much prior work.

I have a lot of questions about how the inputs, outputs and required overrides are interfaced with the system ( the rather sparse conversation in the video suggests that overrides can only be done through distorted-data training, but I'm not sure that principle will hold). This also brings up the sticky problem of misplaced regulatory requirements vs appropriate caution and oversight. It may be too much to hope that appointed techno-bureaucrats can be open minded in the way that I shock and team seem to be, but I hope they can make themselves part of the solution as everyone is learning how this new technology really needs to work.

Overall, I'm very much looking forward not only to continued demonstrations and then early releases, but also to presentations that we occasionally get from Tesla. I think the content of those will be very different but I want to see what they can tell us about what they've learned.
Well said and thought out view. Refreshing from the traditional forum love/hate dialogue.
 
Regarding simulation (much of yesterday's topic here) and whether or not it can be applied to end-to-end training: I don't see why not. Vast training from real world scenes should, I expect, set a good foundation for which enlntities are the critical actors, threats or vulnerabies, thus naturally sifting out irrelevant details of what the sidewalks, lawns and buildings exactly look like. Yet at the same time for example, reinforce the ability to interpret their highly correlating shadows as non-threats, and to differentiate those from other dark blobs that don't seem to belong in the .road.

I'm not deeply familiar with research in this area, but it strikes me that the imperfections and simplifications of simulated video might actually reinforce the training in an unexpected way. It tells the trainee what are the important features on which to concentrate, in order to achieve a high score. And it teaches that reactions to unimportant details, realistic or not, only lower the score.

Perhaps a huge set of videos with a relatively less diverse set of traffic actors and controls, but with endless deliberately randomized background scene features, or even just blotchy noise, become a useful reinforcement as the training moves beyond the initial basics.

Simulators can't get everything perfect, and in human training they can't impart full realism (think g-forces, aromas and the like); still I think they've proven highly useful in training pilots, war fighters and other operators. I recall that the army found early on, that kids who scored high in combat video games - not super realistic at the time - would statistically do better in conventional military training and evaluations (obviously after weeding out the ones with unrecoverable deficits in physical or emotional fitness).
 
To push back against the talk of 'victory laps', 'breakthrough' and 'game set match'.

ALVIN was the first car that used end to end NN and a camera to drive in 1989.
Obviously the system was constrained by hardware and NN architectural limitations.

1*CQV8obPYcZOcM7zvumRdZw.gif


Decades later Nvidia's BB8 (aka dave) also drove using just cameras and a end to end NN.
It had a much improved hardware and much advanced NN architecture.
As you see, it can make turns, handle constructions, etc.
Although it was simply using the SOTA architecture available back then in 2016 (CNN).

Obviously today NN architecture have tremendously improved (Attention, GANs, NerFs, Transformers, Diffusion) thanks to Google AI/Deepmind. What you are seeing now is just the same idea that has long existed being implemented with the latest SOTA model architecture. That's it.

I will repeat the same thing I have said a hundred times. Same things I have said after the 'mind blown' hype of versions 8, 9, 10, 11.

Version 12 will release showing improvements in some areas and regressions in others. It might improve the performance 2-5x for example. But there's a HUGE gap between a system going 100 miles between safety disengagement (if we are being generous to the current v11) compared to the necessary X00,000 miles between safety disengagement to remove the driver.

Version 12 would have to provide a 1,000x improvement compared to version 11 which isn't happening.

What will likely occur is that after a-couple months just like what preceded with v8 (surround video) v9 (vision only), 10 (mind blowing), v11 (one stack to rule them all) and now v12 (end to end). A new hype cycle slogan will be created for v13.

End to End is a popular idea that alot of companies/players has looked into over the years and continue to look into (Wayve, etc).
But its just not viable yet, even with the latest SOTA architectures.


Waymo famously did research into mid to mid called ChauffeurNet back in 2019.
As you can see it 'works'. But for autonomous driving, if its not 'working' for hundreds of thousands of miles between safety disengagement, then it doesn't actually 'work'.

(Please watch both videos)


 
To push back against the talk of 'victory laps', 'breakthrough' and 'game set match'.

ALVIN was the first car that used end to end NN and a camera to drive in 1989.
Obviously the system was constrained by hardware and NN architectural limitations.

1*CQV8obPYcZOcM7zvumRdZw.gif


Decades later Nvidia's BB8 (aka dave) also drove using just cameras and a end to end NN.
It had a much improved hardware and much advanced NN architecture.
As you see, it can make turns, handle constructions, etc.
Although it was simply using the SOTA architecture available back then in 2016 (CNN).

Obviously today NN architecture have tremendously improved (Attention, GANs, NerFs, Transformers, Diffusion) thanks to Google AI/Deepmind. What you are seeing now is just the same idea that has long existed being implemented with the latest SOTA model architecture. That's it.

I will repeat the same thing I have said a hundred times. Same things I have said after the 'mind blown' hype of versions 8, 9, 10, 11.

Version 12 will release showing improvements in some areas and regressions in others. It might improve the performance 2-5x for example. But there's a HUGE gap between a system going 100 miles between safety disengagement (if we are being generous to the current v11) compared to the necessary X00,000 miles between safety disengagement to remove the driver.

Version 12 would have to provide a 1,000x improvement compared to version 11 which isn't happening.

What will likely occur is that after a-couple months just like what preceded with v8 (surround video) v9 (vision only), 10 (mind blowing), v11 (one stack to rule them all) and now v12 (end to end). A new hype cycle slogan will be created for v13.

End to End is a popular idea that alot of companies/players has looked into over the years and continue to look into (Wayve, etc).
But its just not viable yet, even with the latest SOTA architectures.


Waymo famously did research into mid to mid called ChauffeurNet back in 2019.
As you can see it 'works'. But for autonomous driving, if its not 'working' for hundreds of thousands of miles between safety disengagement, then it doesn't actually 'work'.

(Please watch both videos)


You certainly have a lot of certainty based on assumption from a 30 minute video. Lol. I suspect many here would cheer an improvement of 5x from V11 being honest. Also don’t think many care who created it so long as Tesla delivers it but you keep searching for the negatives.
 
Regarding simulation (much of yesterday's topic here) and whether or not it can be applied to end-to-end training: I don't see why not. Vast training from real world scenes should, I expect, set a good foundation for which enlntities are the critical actors, threats or vulnerabies, thus naturally sifting out irrelevant details of what the sidewalks, lawns and buildings exactly look like. Yet at the same time for example, reinforce the ability to interpret their highly correlating shadows as non-threats, and to differentiate those from other dark blobs that don't seem to belong in the .road.

I'm not deeply familiar with research in this area, but it strikes me that the imperfections and simplifications of simulated video might actually reinforce the training in an unexpected way. It tells the trainee what are the important features on which to concentrate, in order to achieve a high score. And it teaches that reactions to unimportant details, realistic or not, only lower the score.

Perhaps a huge set of videos with a relatively less diverse set of traffic actors and controls, but with endless deliberately randomized background scene features, or even just blotchy noise, become a useful reinforcement as the training moves beyond the initial basics.

Simulators can't get everything perfect, and in human training they can't impart full realism (think g-forces, aromas and the like); still I think they've proven highly useful in training pilots, war fighters and other operators. I recall that the army found early on, that kids who scored high in combat video games - not super realistic at the time - would statistically do better in conventional military training and evaluations (obviously after weeding out the ones with unrecoverable deficits in physical or emotional fitness).

I don't think Tesla is using simulation videos for training because:

1) Tweets from Tim Zaman and others only mention/implicate training on real world videos

2) Elon and Ashok only mention good driver videos during the test drive

3) Elon / Ashok only mentioned using V12 in shadow mode in real cars/videos and seeing if there's a mismatch between the driver and V12's decisions

4) Elon mentions curating the hard-to-find examples of people stopping to 0mph at stop signs. If simulation were possible, they'd just simulate 1-2mph as 0mph for example, insert some extrapolated frames/vehicle metadata.

It's intuitive to me that simulation videos aren't used in this case because of how strictly the NN is trained on the nuances of the pixels.
 
Last edited:
I think you misunderstand my question, "Do you have any examples of NNs being better than their human-derived training set?"

There are plenty of examples of NNs being better than an average / typical human. That's not what I'm asking though.

I mean is there an example of a NN trained on a human-derived dataset that is better than any human at that particular task. In our example, we're talking about reaction time / 360-degree awareness.

For example, V12 can definitely be 2x-10x *safer* than a typical human (based on the fact that it never gets tired, always drives based on "good" humans, etc.), but I don't think it can react faster than its training set.
I think you're assuming the car learns delays from the drivers, however:

If the car learns to go or continue on green, stop or continue on yellow, and stop on red, what will its reaction time be to a changing light?

If the car learns hitting objects (or being hit) is bad, how quickly will it respond to an intersecting object? (Including avoiding getting rear ended)

Human perception lags reality by 750mS.
Human reaction lags perception by another 750mS.
NNs can react as soon as the data hits the threshold.

NHTSA | Safety 1N Num3ers - August 2015 | Speeding
 
I think you're assuming the car learns delays from the drivers, however:

If the car learns to go or continue on green, stop or continue on yellow, and stop on red, what will its reaction time be to a changing light?

If the car learns hitting objects (or being hit) is bad, how quickly will it respond to an intersecting object? (Including avoiding getting rear ended)

Human perception lags reality by 750mS.
Human reaction lags perception by another 750mS.
NNs can react as soon as the data hits the threshold.

NHTSA | Safety 1N Num3ers - August 2015 | Speeding

I think it all comes down to how much you think the network is capable of generalizing from the training data:

No generalization would be: "Look for green pixels at position X, Y in camera Z at location lat/long, wait 2 seconds like the training data does, accelerate."

Some generalization would be: "Having some implicit understanding of what an intersection is, wait until the light as close to vertically above your lane turns green, wait for 2 seconds like the training data does, accelerate."

Full generalization would be: "Having an understanding of what an intersection is, what a traffic light means, and the rules of the road, accelerate when it's your turn to go."

I know it's hard to fathom neural networks having anything close to an "understanding" of the world, but they do appear to be able of achieving some sort of process that approximates understanding. There are some academics writing on this topic, for e.g. Do Large Language Models Understand Us?

"It is sometimes claimed, though, that machine learning is “just statistics,” hence that, in this grander ambition, progress in AI is illusory. Here I take the contrary view that LLMs have a great deal to teach us about the nature of language, understanding, intelligence, sociality, and personhood. Specifically: statistics do amount to understanding, in any falsifiable sense."
 
  • Like
Reactions: JB47394
I think it all comes down to how much you think the network is capable of generalizing from the training data:

No generalization would be: "Look for green pixels at position X, Y in camera Z at location lat/long, wait 2 seconds like the training data does, accelerate."

Some generalization would be: "Having some implicit understanding of what an intersection is, wait until the light as close to vertically above your lane turns green, wait for 2 seconds like the training data does, accelerate."

Full generalization would be: "Having an understanding of what an intersection is, what a traffic light means, and the rules of the road, accelerate when it's your turn to go."

I know it's hard to fathom neural networks having anything close to an "understanding" of the world, but they do appear to be able of achieving some sort of process that approximates understanding. There are some academics writing on this topic, for e.g. Do Large Language Models Understand Us?

"It is sometimes claimed, though, that machine learning is “just statistics,” hence that, in this grander ambition, progress in AI is illusory. Here I take the contrary view that LLMs have a great deal to teach us about the nature of language, understanding, intelligence, sociality, and personhood. Specifically: statistics do amount to understanding, in any falsifiable sense."
To illustrate the concept of human thinking creeping into it: we've both assumed green lights matter*.
Why not: stop on red, stop if safely able on solid yellow?

*They do in terms of detecting a failed tricolor traffic light or direction specific signalling, but in a go unless blocking condition logic flow, they are just the else/default case.
 
  • Like
Reactions: scaesare
Version 12 will release showing improvements in some areas and regressions in others. It might improve the performance 2-5x for example. But there's a HUGE gap between a system going 100 miles between safety disengagement (if we are being generous to the current v11) compared to the necessary X00,000 miles between safety disengagement to remove the driver.

Version 12 would have to provide a 1,000x improvement compared to version 11 which isn't happening.

What will likely occur is that after a-couple months just like what preceded with v8 (surround video) v9 (vision only), 10 (mind blowing), v11 (one stack to rule them all) and now v12 (end to end). A new hype cycle slogan will be created for v13.

End to End is a popular idea that alot of companies/players has looked into over the years and continue to look into (Wayve, etc).
But its just not viable yet, even with the latest SOTA architectures.

I sorta agree with you on this.

V12 rubs me the wrong way.

V11 essentially generated an HD map on-the-fly and drove on it with NN-based heuristics (if <car_dbl_parked>, overtake).

It's difficult for me to see how you can get a NN to make consistently reliable decisions based on a pixel flow. I'm not sure how it can generalize the millions of "edge" cases in pixels (which can look different in all sorts of lighting / weather / reflective situations).

For example, people will think that disengagement in Elon's livestream is minor, but to me, it's a telling sign of the difficulty of using pixel nuances to drive a car.

The other thing that bothers me as well is that Tesla is taking the heuristics upstream: in data curation. How they gather what they determine is a "good" driver and how many examples of each case is sorta problematic as well.

Anyway, Ashok seems to have very high confidence in V12, and I trust him, so we'll see.
 
If the car learns hitting objects (or being hit) is bad, how quickly will it respond to an intersecting object? (Including avoiding getting rear ended)
I'm wondering whether the training will result in the car never getting into situations where it would have to test its reaction time. We react when we become aware of a situation only at the last moment. When will that happen to a car with ~30 millisecond reaction times and 360 vision - and where there is a constructive reaction to the situation in that short timeframe?
 
  • Like
Reactions: Mullermn
For example, people will think that disengagement in Elon's livestream is minor, but to me, it's a telling sign of the difficulty of using pixel nuances to drive a car.
Somebody observed that the training set may have told the car to move at a light when other cars around it move. So it may not have been (unintentionally) trained to watch for the location and color of the light so much as the coarser cues of moving cars. For me, that's the screwy thing about this kind of training - you're not sure what lessons it is learning because you're not walking it through a series of steps of discovery about driving. It's all just "watch and figure it out for yourself".

Which reminds me of the Starman scene.