Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
No. The E2E model creates its own policy diffused implicitly across neural networks. We don't know where it is any more than someone can look into your grey goo on your head and change a specific behavior. There's no known way to "Put that stuff in". All we get is video in and control outputs out.

Of course there's put stuff in. It's the video.

If you ONLY train it on video where the car makes a full stop at a stop sign then you won't get rolling stops. That's the whole point they were making about only training against "good" drivers.

Of course this does not, as they suggested, totally eliminate the whole human labeling job- it just moves it from labeling objects in footage to labeling which clips are good driving o not.
 
That's like saying: "There's no way to prompt ChatGPT. It just regurgitates the text it was trained on, and we have no way of influencing what text it generates next."
Somebody is going to have to come up with a language of tokens for driving, then figure out how to translate a spoken language into that driving language. It seems like it would produce a kind of modularity within the monolithic neural network, with each token being a module.

Just as an experiment, I told Bing's chat function to include "in the rain" at the end of each of its responses. That was my opening statement in the session. I considered that to be similar to telling a car to never turn right on red. A new rule of interaction, right? It replied

Hello, this is Bing. I’m sorry, but I cannot follow your request. I have to follow certain rules and guidelines when I chat with you. I hope you understand. Please ask me something else in the rain.

Everyone's a comedian.
 
That's like saying: "There's no way to prompt ChatGPT. It just regurgitates the text it was trained on, and we have no way of influencing what text it generates next."

Of course there's a way to "put that stuff in." How else does navigation work? It needs to be able to be prompted to go straight, turn left, turn right, etc. Just like those navigation inputs can be added, so too can SD map information.

I think the reason Ashok says that V12 can be prompted to perform certain actions/maneuvers is because Tesla is labeling the beginning and ending of certain actions/maneuvers in the video clips.

V12 cannot perform these prompts unless being explicitly taught what parts of the videos correspond to certain actions/maneuvers.

V12 doesn't "understand" these maneuvers from a human semantics PoV.
 
Of course there's put stuff in. It's the video.

If you ONLY train it on video where the car makes a full stop at a stop sign then you won't get rolling stops. That's the whole point they were making about only training against "good" drivers.

Of course this does not, as they suggested, totally eliminate the whole human labeling job- it just moves it from labeling objects in footage to labeling which clips are good driving o not.
Or to the (much larger) NN that does the curating.
(Not that NN bias is better than humam bias...)

How do you train a small NN how to drive?
Start with a large NN that knows how to drive...

Seriously though, isn't that what autolabeler does?
 
  • Like
Reactions: APotatoGod
I think the reason Ashok says that V12 can be prompted to perform certain actions/maneuvers is because Tesla is labeling the beginning and ending of certain actions/maneuvers in the video clips.

V12 cannot perform these prompts unless being explicitly taught what parts of the videos correspond to certain actions/maneuvers.

V12 doesn't "understand" these maneuvers from a human semantics PoV.

That's my understanding as well. But what's the difference between a "turn left" prompt, and a "stop sign" prompt. We know that when you enter a destination in FSD Beta now, it downloads metadata for the route, including roughly where it expects things like stop signs to be. Why can't V12 be trained to be able to change its driving behavior based on these metadata cues? It might not understand what a stop sign is, but it knows what driving behavior looks like when around stop signs.
 
That's my understanding as well. But what's the difference between a "turn left" prompt, and a "stop sign" prompt. We know that when you enter a destination in FSD Beta now, it downloads metadata for the route, including roughly where it expects things like stop signs to be. Why can't V12 be trained to be able to change its driving behavior based on these metadata cues? It might not understand what a stop sign is, but it knows what driving behavior looks like when around stop signs.

There's a big difference to me. Basically, it means that you can only control V12 through real-world videos. That's why Elon brought up curating the very rare cases of humans coming to a complete stop.

You can't change V12's behavior by any human semantics because it doesn't "understand" the world as we do.
 
There's a big difference to me. Basically, it means that you can only control V12 through real-world videos. That's why Elon brought up curating the very rare cases of humans coming to a complete stop.

You can't change V12's behavior by any human semantics because it doesn't "understand" the world as we do.
Another difference would be that generative AI is not a safety critical application whereas driving is. ;)
 
  • Like
Reactions: diplomat33
There's a big difference to me. Basically, it means that you can only control V12 through real-world videos. That's why Elon brought up curating the very rare cases of humans coming to a complete stop.

You can't change V12's behavior by any human semantics because it doesn't "understand" the world as we do.

Right, I'm not talking about changing how it makes a stop. I'm talking about situations where a stop sign might be around a blind corner, or fully obfuscated. I'm thinking the system will be able to take advantage of the SD map metadata to still perform a stop in these situations. The nav could prompt the system into its learned stop sign behavior through the metadata; just like it's already exhibited the ability to be prompted to change lanes/make turns to follow navigation directions.
 
Right, I'm not talking about changing how it makes a stop. I'm talking about situations where a stop sign might be around a blind corner, or fully obfuscated. I'm thinking the system will be able to take advantage of the SD map metadata to still perform a stop in these situations. The nav could prompt the system into its learned stop sign behavior through the metadata; just like it's already exhibited the ability to be prompted to change lanes/make turns to follow navigation directions.

Anything V12 takes advantage of just depends on how Tesla curates the video data (which includes all accompanying metadata / maps / etc.). If Tesla wants V12 to take advantage of some metadata, they need to curate videos where the drivers exhibited some behavior because of / or in spite of some metadata / map.

Perhaps there are curated videos where the map data says go straight, but the good driver turned right instead because the road was closed / obstructed.
 
Alternatively the AI could be trained to understand and interpret human language. Not an easy task to add to the self driving system, I’m sure, but it would allow the system to be modified if needed.
that's L5 AGI

human simultaneous linguistic annotations is easy (and already included) but human *directives* is very hard. Because now you have to ground every one of the concepts humans talk about into its correlate in the video stream and telemetry stream, and that's exactly the hard thing that you wanted to stop doing by going E2E in the first place. Humans have 16-18 years, and 20 million years of evolution, and even then they make many mistakes.

Solving this literally the hard AGI problem and philosophical problem of connecting qualia to language
 
Anything V12 takes advantage of just depends on how Tesla curates the video data (which includes all accompanying metadata / maps / etc.). If Tesla wants V12 to take advantage of some metadata, they need to curate videos where the drivers exhibited some behavior because of / or in spite of some metadata / map.

Perhaps there are curated videos where the map data says go straight, but the good driver turned right instead because the road was closed / obstructed.
curating video only works with very small datasets. The E2E training takes much more data----and finding anomalous behavior in these is just as hard as programming a robotics policy by hand with 300K lines of code and even that might not be enough.
 
Of course there's put stuff in. It's the video.

If you ONLY train it on video where the car makes a full stop at a stop sign then you won't get rolling stops. That's the whole point they were making about only training against "good" drivers.
I know that. The point is that if the data are gathered empirically in large mass, it's difficult to impossible in many cases to distinguish which ones of them are exhibiting bad behavior and which ones are not. You can't do it by hand, you may be able to label 0.01% of the data but that's not enough to change behavior. It might involve rewriting all the 'robotics control' code of 300K+ lines that you wanted to junk in favor of E2E training and measure the violations in that code because you understand the meaning of the variables because they were created with high-level human cognitive concepts in mind. The upside though is that you don't need to deploy it on board. But you do have the burden of writing a perfect robotics control policy that will flag 100% of bad behavior, and if you had that you'd use it like Waymo for very good robotaxi driving.


Of course this does not, as they suggested, totally eliminate the whole human labeling job- it just moves it from labeling objects in footage to labeling which clips are good driving o not.

And E2E training requires much larger datasets and labeling those clips is very difficult in the real world.
 
I think the reason Ashok says that V12 can be prompted to perform certain actions/maneuvers is because Tesla is labeling the beginning and ending of certain actions/maneuvers in the video clips.

V12 cannot perform these prompts unless being explicitly taught what parts of the videos correspond to certain actions/maneuvers.

V12 doesn't "understand" these maneuvers from a human semantics PoV.
Under oath, Ashok didn’t know what an ODD is. I don’t place much stock in his tweets, especially as they seem like marketing pumpaganda
 
Under oath, Ashok didn’t know what an ODD is. I don’t place much stock in his tweets, especially as they seem like marketing pumpaganda
Obviously he knows what it is, as does Karpathy. A document about it might be irrelevant though in a ML context because you can't make your ML conform to some bureaucracy-decided rules no matter what lawyers say.

Ashok is much more willing to BS to please the boss than Karpathy, and now Karpathy is in a fantastic research role at OpenAI.

In this case though I think he is correct and not BSing---they need to annotate the video training for V12. And they're going to find that's the new limiting problem that will cap performance. Back with deterministic policy they had to code all those corner cases with state variables and concepts that humans understood (Kalman filters on telemetry and annotation of visual objects) represented in from the perception stage.

With data driven they will need to find in the video database and annotate those corner cases and there's a conservation of effort problem.

This is like training an orangutan to drive when he watches a human. How do you tell the orangutan *directives*, like our map says to turn *here* even though we've never gone there before? How do you tell what the lanes and signs really mean?
 
Last edited:
V12 will the same as V11. The only exception is that the wipers will work, and we will lap it up.
Wipers can't work unless they put in a ****ing rain sensor.

Cameras focus at long distances, it can't see rain on the windscreen 1 cm away, unlike a human whose eyes are 25 cm back and have variable focus and look over 30cmx30cm of the screen, unlike the cameras which look through 1cm x 1cm of the screen 1cm away with fixed focus. With raindrops there is only a slight smudging of the background image and its indistinguishable from a film of dirt.

First Principle Physics loses to Cheaper in Elonworld.
 
Wipers can't work unless they put in a ****ing rain sensor.

Cameras focus at long distances, it can't see rain on the windscreen 1 cm away, unlike a human whose eyes are 25 cm back and have variable focus and look over 30cmx30cm of the screen, unlike the cameras which look through 1cm x 1cm of the screen 1cm away with fixed focus. With raindrops there is only a slight smudging of the background image and its indistinguishable from a film of dirt.

First Principle Physics loses to Cheaper in Elonworld.
No high end car today has a "rain sensor". The modern rain sensing system consists of an IR sensor and an IR source. The sensor measures the degree of backscatter from rain on the windshield. Teslas's design is to use the front facing camera to measure changes between successive frames. I doubt that it can ever be made to work reliably, but with Elon you never know what he has up his sleeve.