Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
There is plenty of data for the auxiliary task of predicting images 50 ms ahead, but that doesn't give a driving system. Predicting words one token ahead is very close to a useful chatbot, but the auxiliary task in driving is much further away.
There have been self-supervised training approaches that input and output the same modality, e.g., partial images -> full image prediction, that can train on a lot more unlabeled data, and then later the self-supervised representation can be used to train a relatively simple linear classifier to output a different modality, e.g., image -> text labels.

Do you think if Tesla is able to get a video -> video model to work consistently, an end-to-end video -> control might not be much further away?
 
Do you think moving to self-supervised pre-training takes even more advantage of Tesla's potential for fleet data and training compute? Instead of "just" building from existing human knowledge of 11.x's network predictions for objects, occupancy, lanes, traffic controls, etc., the neural network architecture and training data could be reshaped to be more general, so it could potentially learn things that human understanding might have overlooked.

I wonder if self-supervised training including videos of crashes would even help the neural network better understand which situations are more likely to result in accidents. If there's enough examples for that level of understanding, then a finetuned control head could potentially even generally learn to avoid these dangerous situations.

In theory it would seem to be the case, learning to predict video of future (or randomly masked) time intervals from a training set could learned a really compressed representation of many aspects of driving on roads. However I'm not sure Tesla is going to focus on this. Look at these tweets:

Screenshot 2023-12-04 at 11.00.29 AM.png


I interprest auto-regressive to mean self-supervised in this sense, and Elon thinks it's "too complicated". To me this signals the team is really focused on the supervised approach alone. It is the 'simplest' IMO, in response to this tweet:



Yes, but only when the amount of supervised labels (which is token ahead prediction) grows directly proportionally to that.

The problem with L4 is the significant out of bounds and controllability requirements. GPT-4 works really well because almost any statement that it needs to think about is not outside its training domain as it has ingested almost all of computer readable text.

We can't get there like that with just using thin human control as the only supervision signal. There is plenty of data for the auxiliary task of predicting images 50 ms ahead, but that doesn't give a driving system. Predicting words one token ahead is very close to a useful chatbot, but the auxiliary task in driving is much further away.

This self supervised general purpose will be good at making a very natural L2+++ ADAS system, and then probably get stuck, and then it's a grey goo we don't understand and don't have a path to true controllability. Humans need to guarantee controllability to ideal laws in order to pass regulation and acceptance, not statistical behavior on near training sets.

Potentially if massive compute is able to generate full fidelity simulated data indistinguishable from real in the lab (simulating control of ego and other objects response and the photon response of the sensors including weather and glare), then the massive compute-and-search based directives of Sutton would work, as that would give the massive quantity of simulated correct and incorrect labels that adhere to the regulatory requirements.


What level of compute is needed? If it is 500x what is on board today it might not work, as Moore's law has stopped.

Well generated volume of supervised labels is relatively easy compared to other AV players, but yeah who know what will be enough? I agree self driving is harder because it's more high dimensional, sparse space and the amount of data needed to converge on a robust solution could be extreme. But if Tesla can grab enough 'edge' cases to have multiple samples per almost every possible condition, I think simulation can help augment (they will need multiple samples of everything though).

And I agree the prediction isn't simply one step in the future, there's a lot of non-linear dynamics involved that make it much harder every second in the future you go.

I think a L2++ is close to guaranteed but I have no idea when L4/L5 will happen in a generalized sense. However if you truly believe Waymo has figured out something close to general L4 with NN + heuristics, I believe NN alone will reach it too with a few orders of magnitude more data.

And I'm definitely not informed enough to determine what level of compute will be needed! That's why I have no investement in Tesla based on robotaxis. And if the market every thinks robotaxis are coming and priced it in the stock, I would certainly sell as it's likely a longer tail than people think at that time.
 
I think Tesla is solely using good driver videos both to generate the world model and also output the controls. I haven't heard Ashok or Elon say something contrary to this.
Phil Duan did say quality is important, but it sounds like diversity is also important especially for the world model:

The other thing that really matters is data diversity. If you only have boring data, this thing will rarely work. This is also one of our biggest advantages where we can get all this data back from the fleet. This is just some examples of random Tesla customers driving. I'll just pause for a second; let you guys see what kind of crazy scenarios we can see here. A lot of the stuff you will rarely run into in real life, but once you aggregate over 4 million cars, weird stuff happens every single day. And what we do here is that we put this data back in the data engine directly, run the auto labeling pipeline, fit into the data set and model training.​
Once you put all this stuff together, you'll be able to build a very large, highly-diversified, high-quality data set, and we use this to train the foundation model. In our opinion, this is the path forward to build a foundation model for autonomous driving as well as eventually the whole embodied AI.​

The video clip of examples is the same shared by Ashok Elluswamy and Tesla AI, so it could be that diversity is as important as quality at least for training the world model before finetuning for control?
 
Phil Duan did say quality is important, but it sounds like diversity is also important especially for the world model:

The whole problem is more or less getting enough data that represents everything that can happen in the ODD ie the geo at all types of weathers at all types of light conditions. :)

Unless you limit the ODD it’s pretty much impossible to provide safety guarantees imho.

With camera only I doubt it’s possible to get to autonomy in any meaningful ODD.
 
  • Like
Reactions: diplomat33
The whole problem is more or less getting enough data that represents everything that can happen in the ODD ie the geo at all types of weathers at all types of light conditions
Presumably not all combinations require comprehensive coverage in the training data if the networks are able to generalize from enough situation pairs. For example, driving on a straight road at night or curved road with inclement weather or turning at an intersection with heavy traffic might have many more examples than making a turn at night with snow, but ideally the relatively small number of good examples of slower-than-usual approach for the snowy turn could be enough to learn about increased caution from general wet/snowy/icy conditions.

If it turns out it's insufficient in certain condition combinations, might general end-to-end data collection of disengagements or shadow mode find these problematic situations that need additional training without expending resources on "easier" situations?

I suppose the trickier aspect is if there's certain conditions that require completely different driving behavior?
 
Presumably not all combinations require comprehensive coverage in the training data if the networks are able to generalize from enough situation pairs. For example, driving on a straight road at night or curved road with inclement weather or turning at an intersection with heavy traffic might have many more examples than making a turn at night with snow, but ideally the relatively small number of good examples of slower-than-usual approach for the snowy turn could be enough to learn about increased caution from general wet/snowy/icy conditions.
Sim helps a lot. You can simulate the same scene in artificially created weather.
If it turns out it's insufficient in certain condition combinations, might general end-to-end data collection of disengagements or shadow mode find these problematic situations that need additional training without expending resources on "easier" situations?

I suppose the trickier aspect is if there's certain conditions that require completely different driving behavior?
Yeah, there is a reason Waymo is avoiding places with black ice (but apparently they're working on it).
 
It's semantics, the training of either approach is the same. You still have backprop through the modules. It's just each piece has been pre-trained. That's not a bad thing, it assuredly allows your total model to converge to decent results more rapidly. And it can still allow the perception component to evolve over time. it doesn't have to be static.



Feedback from pedal & wheel control backward toward trying to figure out which component of the network needs updating is indeed a weak signal. But this is counterbalanced by the sheer volume of data available. If the labeled data shows 100k cases where the driver changes lane when it sees an object (and 0 where it runs over the object), then the NN will learn to always move over for the object). Even if there are other reasons you may change lanes where a few training samples wouldn't be sufficient to clarify, the volume wins out.

I don't think people are really understanding what is happening in machine learning with transformers and compute. Look at this post going around these days in ML world, as well this paper on scaling laws.

Intuition and domain knowledge eventually lose out to compute and data. These models just keep improving in a consistent, predictiable fashion as long as you allow enough compute, data, and parameters in the model. The actual architecture / depth / width matters less!

Self driving cars at scale is one of the hardest problems, which is why it will need to be solved by ML models. There is no reason to think it will behave differently than other complex models. What wins? Compute and data. Clean data. The Attention mechanism / transformer models have shown strong ability to generalize from a data set and store nuanced understandings given enough data.

Worried about navigation? Just literally feed a picture of the navigation screen into the training, like a human.

The main issues going forward I see are:

1) Cleaning the data. You are ingesting so much data, and you want to remove contradictory data & data from say distracted drivers.

2) Inference compute. All of what I said above does not imply a great model can work on the limited compute resources.
Openpilot does something like this for E2E navigation, they draw a map view with the intended driving path for the model, like this...

1701724140091.png
 
  • Like
Reactions: ZeApelido
There have been self-supervised training approaches that input and output the same modality, e.g., partial images -> full image prediction, that can train on a lot more unlabeled data, and then later the self-supervised representation can be used to train a relatively simple linear classifier to output a different modality, e.g., image -> text labels.

Do you think if Tesla is able to get a video -> video model to work consistently, an end-to-end video -> control might not be much further away?
Yes, that's exactly what I think: success at perception, even self supervised, still leaves a large gap to the next stage of control.

They will get video->video autoregressive prediction effectively, that's within current knowledge and feasibility.
Video->control at a sufficiently reliable level for L4 robotoaxis is much further away. A video -> some kind of control is likely reasonably close (this will be what they deploy as L2 ADAS), but video -> robotaxi customer & regulatory acceptance level will be much further.

And the last is because the video prediction task, where they have lots of inexpensive data, does not directly help.

Already the current Tesla system which makes lots of labels for perception is not the problem most of the time. Policy, not perception is the problem now. Giant video world models are redoing what is already working well enough. Only if somehow these enable major new generations in policy technologies which were unavailable before with the existing perception stack is this going to be a breakthrough.

Most of the time now the problem isn't that it sees the wrong thing, but that it does the wrong thing. Some of that is potentially perception errors but mostly it's control and policy and especially external mapping errors.

I'm not saying that Tesla is doing anything wrong. Because other than the ill considered word of one person, Tesla is exclusively working on L2+ ADAS which fits within the budget and sensor bounds of an existing fleet.

Waymo tries to solve: "What do we need to do to make a robotaxi business that customers will pay for and regulators will accept?"

Tesla tries to solve: "What can we do to help sell cars but add minimal incremental hardware cost to what we do now?"

The problem is Elon's hyping and fibbing exclusively, and any interference internally on positioning or designing Tesla away from what they are actually doing: successful L2 driver assist. The other problem is unwillingness (Elon) to pay for the maps good and clean enough for even reliable enough L2 or to pay to correct them internally.

No other automaker is saying they can, or even intend, to deliver L4 commercial level autonomy on the hardware level they are shipping now or intend to ship as mainstream cars.
 
Last edited:
I'm not saying that Tesla is doing anything wrong. Because other than the ill considered word of one person, Tesla is exclusively working on L2+ ADAS which fits within the budget and sensor bounds of an existing fleet.
I get this, and I also get that Mr. Musk is a visionary who trusts his people and what they say. What is totally cool is that he takes the heat for slipped schedules, engineering misses, and still supports his design and engineering team. That's what made Tesla, SpaceX and Starlink the successes they are. Send people to the moon scoreboard - Apollo 6 and Artemis 0. Musk is a hero to me even though I spent $10000 for "FSD" in early 2020. Bravo. Pioneers usually have a lot of arrows in their backs. Too bad Tesla bet on California for their leading edge tech manufacturing. Compare to California High Speed Rail ha ha ha ha ha ha ha ha ha. Even Newsom killed it at one point...(moderator edit)
 
Last edited by a moderator:
  • Disagree
Reactions: KArnold and EVNow
Ha ha I guess the mods didn't like… (moderator note: if the moderators delete a post and you repost it, that’s a pretty easy way to get your account suspended)…I should have said that I am generally happy with what I have L2 Driver Assist.
 
Last edited by a moderator:
  • Disagree
Reactions: EVNow
I get this, and I also get that Mr. Musk is a visionary who trusts his people and what they say. What is totally cool is that he takes the heat for slipped schedules, engineering misses, and still supports his design and engineering team. That's what made Tesla, SpaceX and Starlink the successes they are. Send people to the moon scoreboard - Apollo 6 and Artemis 0. Musk is a hero to me even though I spent $10000 for "FSD" in early 2020. Bravo. Pioneers usually have a lot of arrows in their backs. Too bad Tesla bet on California for their leading edge tech manufacturing. Compare to California High Speed Rail ha ha ha ha ha ha ha ha ha. Even Newsom killed it at one point...(moderator edit)
That's an oversimplification. Elon have in the past not listened to his engineering team, for example for the radar removal (which from rumors, the engineers opposed removing, but Elon insisted).
 
I get this, and I also get that Mr. Musk is a visionary who trusts his people and what they say. What is totally cool is that he takes the heat for slipped schedules, engineering misses, and still supports his design and engineering team. That's what made Tesla, SpaceX and Starlink the successes they are. Send people to the moon scoreboard - Apollo 6 and Artemis 0. Musk is a hero to me even though I spent $10000 for "FSD" in early 2020. Bravo. Pioneers usually have a lot of arrows in their backs.
He is different now, and worse than he used to be, when he counteracted major flaws with strengths. The flaws have amplified and strengths no longer as valuable.

SpaceX is doing great and needs only his money which has been instead blown on Twitter. Nobody is going to Mars now, because of an impulsive shitposting tic.

Too bad Tesla bet on California for their leading edge tech manufacturing.
Why? The manufacturing which made them where they are? Where the factory was close to the engineering design? Where they had Palo Alto engineers from 2010 on and not Detroit or Munich engineers? The ones who could write car software right? The ones who got the battery design right long before others did? (GM is still ultimaing away with pouch cells in modules)

California vs Midwest or Japan made a huge difference in Tesla's success.

Compare to California High Speed Rail ha ha ha ha ha ha ha ha ha. Even Newsom killed it at one point...(moderator edit)
Sure, rail costs way too much to build but that's everywhere in USA.
 
Phil Duan did say quality is important, but it sounds like diversity is also important especially for the world model:

The other thing that really matters is data diversity. If you only have boring data, this thing will rarely work. This is also one of our biggest advantages where we can get all this data back from the fleet. This is just some examples of random Tesla customers driving. I'll just pause for a second; let you guys see what kind of crazy scenarios we can see here. A lot of the stuff you will rarely run into in real life, but once you aggregate over 4 million cars, weird stuff happens every single day. And what we do here is that we put this data back in the data engine directly, run the auto labeling pipeline, fit into the data set and model training.​
Notice 'auto labeling pipeline' which is not the pure vision transformer or pure end to end training from control signal exclusively back to cameras. Auto labeling pipeline is more like the current Karpathy solution.


Once you put all this stuff together, you'll be able to build a very large, highly-diversified, high-quality data set, and we use this to train the foundation model. In our opinion, this is the path forward to build a foundation model for autonomous driving as well as eventually the whole embodied AI.​
They will get diversity in video for perception---the problem is as always the control problem.
 
Yes, for control, quality is more important than diversity.

You may get debris or some other obstruction in your path in a million ways. But how you deal with it is limited. Not sure how end-to-end handles it ...
This is why I think they will still need the 300K lines of policy code somewhere. If not on board, it is executing somewhere on the training simulation to produce simulated training data with an appropriately safe policy that meets external requirements.

Remember there is a major difference between end-to-end nets, and end-to-end nets *training*. The central question is how is the policy network going to be trained, and against what ground truth targets and what loss function? The set of observed human driving is likely *not* enough for when it matters.

We do not train human aircraft pilots from observing 1000 easy commercial take offs and landings and training a network to reproduce the throttle and stick behavior for that.

Aircraft pilots have a strong understanding of underlying physics-based phenomena and have a physics model in their heads (what existing hard coded car policy does and they're trying to get rid of), and they are explicitly tested on all sorts of rare corner cases which require sophisticated thought and understanding. I dont think any car system is designed yet to deal much with this, but aircraft pilots learn to safely deal with significant instrumentation failures and misleading data as well as hardware failures which alter the dynamical response of the aircraft, so that in a couple of seconds they can fly in a way different from how they normally would because they have an idea of what happened and how to counteract it.
 
the video prediction task, where they have lots of inexpensive data, does not directly help
You qualify the video -> video prediction task as a mainly helping perception and not control policy, but would you agree that for video predictions to be accurate, it probably has at least some general internalization of control related concepts? For example, if you provided video leading up to a red light vs green light, the predicted subsequent video frames could reflect slowing down vs maintaining speed even though it was not explicitly trained to output controls. Expand that to many other video prediction situations where there's a lead vehicle or not, stop signs, crossing traffic, etc. where it needs to predict video frames reflecting speed control.

I totally agree that good control will need dedicated training data, but potentially the amount needed can be significantly less because the pre-trained world model already ends up with at least basic concepts of average control as opposed to introducing a completely new idea trained from scratch.
 
Yes, for control, quality is more important than diversity.

You may get debris or some other obstruction in your path in a million ways. But how you deal with it is limited. Not sure how end-to-end handles it ...
I don't think you get what they're doing. At some point, in the not to distant future (or maybe right now) the program will understand what's happening the same way (or better) than a human. Sounds like a fantasy - like so many other modern marvels - but that's what's happening. The bad news is that they need to solve for AGI to get autonomy working. The good news is that they're solving AGI.
 
Notice 'auto labeling pipeline' which is not the pure vision transformer or pure end to end training from control signal exclusively back to cameras. Auto labeling pipeline is more like the current Karpathy solution.
Yeah, I noticed that too and wondered if it's used for another fine-tuned output head adjacent to control. Explicit labeled training targets for lanes, objects, signals, etc. could potentially boost world model internal weights for these concepts and speed up learning, and these stronger signals could then be more useful for other downstream tasks like control policy.

One practical use of another head trained from labeled data is to produce visualizations. Another use of labeled data is to make it searchable if Tesla is looking for certain types of behaviors, e.g., examples of adjacent green turn signal when you're the first vehicle at the stop line and the driver did not go.
 
I don't think you get what they're doing. At some point, in the not to distant future (or maybe right now) the program will understand what's happening the same way (or better) than a human. Sounds like a fantasy - like so many other modern marvels - but that's what's happening. The bad news is that they need to solve for AGI to get autonomy working. The good news is that they're solving AGI.

I see the “wishful thinking” stage is in full swing. :D