Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register
This site may earn commission on affiliate links.
You qualify the video -> video prediction task as a mainly helping perception and not control policy, but would you agree that for video predictions to be accurate, it probably has at least some general internalization of control related concepts? For example, if you provided video leading up to a red light vs green light, the predicted subsequent video frames could reflect slowing down vs maintaining speed even though it was not explicitly trained to output controls. Expand that to many other video prediction situations where there's a lead vehicle or not, stop signs, crossing traffic, etc. where it needs to predict video frames reflecting speed control.
I mean sure, you could have the video -> video net take in throttle and steering inputs and it will predict where the car is going to go---that's learning a physics network of the dynamics. This wasn't really a big problem before though as the ground truth physics network learned from regular control systems theory, and yes, accurate physics is at least as good. Rediscovering Newton is cool but not the problem.

The issue remains with policy: where *should* the car go?


I totally agree that good control will need dedicated training data, but potentially the amount needed can be significantly less because the pre-trained world model already ends up with at least basic concepts of average control as opposed to introducing a completely new idea trained from scratch.
 
  • Like
Reactions: pilotSteve
The central question is how is the policy network going to be trained, and against what ground truth targets and what loss function?
Presumably when fine-tuning the policy network, the training data includes the video, navigation route and other context to then minimize error from human controls. The video forwards through the world model to generate its internal understanding of the environment acting as additional context/input for deciding controls.

Are you suggesting this would likely not work for even relatively basic tasks or are you raising concerns for complex and rare situations?

For example, do you think something like this could learn that it should switch left into a faster lane because it's far enough from the upcoming highway exit and similarly should switch right to prepare to exit when closer? I suppose there could be a potential concern that many humans are happy to follow behind slower traffic and not bother with the faster lane?
 
  • Like
Reactions: pilotSteve
Presumably when fine-tuning the policy network, the training data includes the video, navigation route and other context to then minimize error from human controls. The video forwards through the world model to generate its internal understanding of the environment acting as additional context/input for deciding controls.

Are you suggesting this would likely not work for even relatively basic tasks or are you raising concerns for complex and rare situations?
That is true end-to-end training where the error (supervision signal) is difference of ego from human-chosen path & velocity and that loss is backpropagated all the way back to perception. That's really hard because the bandwidth of the supervision signal (human path) is really low compared to video. They could make a secondary task of video->next frame video prediction in order to make good internal perceptual models but in the end the net should only use capacity (quantity of weights devoted to a task) to do the task it needs to: drive, and not to predict video. Predicting video is OK for an offline generative simulator to make representative movies.

I think it would work for basic L2 ADAS after a while. It would be throwing out most of what they have done for the last 10 years and starting afresh but if they do it well it could make for a good autopilot which feels more natural, as that's the weak spot now. It could be a good product direction for a L2.

But it might be much much harder to guarantee controllability and safety in other cases needed to get to L4. It will fail in weird ways and they won't really have any idea how to fix it. They could try to put in some more examples of those fail cases in the dataset and retrain, but then that could make other things work worse in unpredictable ways. Some people will experience declines in performance with new versions on their routes and complain, while others see improvements. And then it fluctuates back again. ( In my own work (not cars but much simpler ML training), merely retraining nets with different random seeds for initialization and data shuffling can result in quite different networks even with the same dataset. Generally the AUC performance might be about the same in a classifier but the ordering of the scores in each class could change more than one would hope depending on the seed.)

Because it's all an incomprehensible grey neural goo with end to end. Our own brains are similarly incomprehensible on the neuron level but we have enough natural intelligence that people can be literally told the rules and generally obey and understand.

Hypothetically with a really huge net and a really huge training dataset of absolutely beautiful data, most from highly realistic simulations covering many rare cases and showing correct behavior, you might get there. But even still I think Mobileye and probably Waymo work with multiple simultaneous algorithms with different sensor sets, with radar & lidar safety backup constraints which are probably not neural but transparent physics coded, and override the baseline control when necessary.

For example, do you think something like this could learn that it should switch left into a faster lane because it's far enough from the upcoming highway exit and similarly should switch right to prepare to exit when closer? I suppose there could be a potential concern that many humans are happy to follow behind slower traffic and not bother with the faster lane?

If there are varying examples in the dataset it will do both behaviors at different times for unknown reasons. We don't know if it will learn enough context that human drivers know (like humans might think this specific exit is a difficult to merge back in at 6pm on weekdays because I've seen it many times before), like if somethign is a special case or a general pattern it should follow.

It would *feel* like quick progress as the human driving as supervision will file off the rough edges in policy now and what people experience day to day, which is similar to the bulk of training data of benign driving situations, could feel better than current policy systems.

But progressing beyond that in a demonstrable way---and not Elon's gut experience which has governed it so far---is very hard.

Consdier again the comparison to skilled human commercial pilots. They don't train on 1000 ordinary takeoffs and landings.
 
Last edited:
Yeah, I noticed that too and wondered if it's used for another fine-tuned output head adjacent to control. Explicit labeled training targets for lanes, objects, signals, etc. could potentially boost world model internal weights for these concepts and speed up learning, and these stronger signals could then be more useful for other downstream tasks like control policy.

One practical use of another head trained from labeled data is to produce visualizations. Another use of labeled data is to make it searchable if Tesla is looking for certain types of behaviors, e.g., examples of adjacent green turn signal when you're the first vehicle at the stop line and the driver did not go.

I think the most likely scenario is that Ashok is hyping up something to Elon as something revolutionary and amazing when in truth it's yet another evolutionary development.
 
Based on Teslascope comments I think there is a good chance that V12 is going out to employees now as part of holiday update.

We find out tomorrow.
My prediction: what Ashok is hyping as end-to-end in V12 isn't a end-to-end training. Instead it uses evolutions of existing perception with existing autolabels for intermediate objects, and a neural network distillation of the existing optimization and rule based policy planner. Now the planner lives back on the training servers and can have a higher computational budget (if that was limiting it before) there to make the training labels for the policy network. Maybe some very cleaned up human driving clips can be added too.

It's technically 'nets all the way through', but not in the grand general control->photon sense in training.

I'm not saying this a bad strategy, it's probably correct for their current position. Remember to interpret Elon's words as fuzzy generalities pulled into hype.

With time they can move video->video predictors into the perception side, but still with some of the existing autolabeled intermediate objects used in policy.
 
Based on Teslascope comments I think there is a good chance that V12 is going out to employees now as part of holiday update.

We find out tomorrow.
I know you are super-bullish about everything Tesla - still ....

Anyway, I personally hope they send V12 to everyone when it is as the benchmark used for V11. They made sure there was hardly any regression on V11 compared to before the "single stack".

In other words, the gate should be quality and not calendar.
 
I know you are super-bullish about everything Tesla - still ....

Anyway, I personally hope they send V12 to everyone when it is as the benchmark used for V11. They made sure there was hardly any regression on V11 compared to before the "single stack".

In other words, the gate should be quality and not calendar.
Agreed. What is the rush other then to have something new to complain about. Test it, refine it, and release it when it’s ready not when we want it.
 
These 2023 Holiday Update Park Assist visualizations (at a supercharger?) are much more detailed than FSD Beta 11.x occupancy network visualizations. I wonder if this is a preview of FSD 12.x visualizations?

v12 park assist.png
 
You can't see a practical use for the park assist?
Yeah, this is a much better representation of the raw occupancy network (ON) output than converting the data to virtual ultrasonic signals to use the existing swiggly line UI.

This is much closer to the 3D 360 parking view that many people have been calling for (and other cars have).

From the looks of it, they may have used the suggestion I gave a long time ago (lower the draw distance and increase the resolution when in parking mode). Even though it made sense to just use the FSD ON on the first go, that would be a very logical optimization for parking assist, given when parking, you only care about your immediate area, not about things very far away that you would never come close to hitting. The concern that remains is to improve how the car handles the front blindspot.
 
As displayed not useful, but hopefully there is a lot more to it - and will warn you (without hazardous latency, like there is in the rear camera feed) before you hit something (most importantly), so you don’t have to look at it!
We’ll see. Of course, working fast (and most importantly, available) Autopark would be good too.
 
  • Informative
Reactions: APotatoGod
Yeah, this is a much better representation of the raw occupancy network (ON) output than converting the data to virtual ultrasonic signals to use the existing swiggly line UI.
I think I'd want my car to show translucent so I don't have to interact with the screen to see everything, but I do like the proximity coloring. I can't wait to see what it does in my driveway with bushes, light poles, garage door and such.
 
  • Like
Reactions: APotatoGod
lower the draw distance and increase the resolution when in parking mode
Ashok Elluswamy answered a question about 3D voxel sizes at CVPR: "Even for between the robot and the car, you can configure different sizes as they have different needs… Optionally, it can be made queryable as in you can get arbitrary precision…"

optimus occupancy.jpg


Over a year ago at AI Day 2022, Tesla showed Optimus with ~2" voxels, and the new Park Assist visualization seems to be roughly the same 2" size as well. It'll be interesting if Tesla is using a single model that supports multiple voxel sizes and if multiple sizes can be used simultaneously to focus on close and far objects, e.g., squeezing past vehicles as you prepare for an unprotected turn.

I suppose Park Assist is primarily when FSD Beta isn't active currently, so potentially this switching resolution is something Tesla engineers are coding/deciding ahead of time based on heuristics. But potentially even relatively soon, end-to-end will need to dynamically adjust resolution to handle parking lots for smart summon and "parking anywhere."