From Tesla's CVPR presentations about the general world model / vision foundation model, these seem to be generally trained not specific to certain control behaviors, so it can use the giant amount of video available to Tesla without the extra step of preprocessing which videos to use. The example of predicting future video based on past video seems like a relatively straightforward self-supervised "pre-training" step for developing a general world model.
True, it's a clever step in a direction that doesn't obviously lend itself to making an autonomous vehicle, particularly L3 and above, though. It might prove to be useful but the gap from that to a product is wide.
I believe also for language models, the pre-training step is multiple orders of magnitude larger amount of data to train the foundation model than the finetuning step to bias responses and in this case control.
Again true, but the task to be solved by the eventual chatbot (generate novel relevant token sequences) is still nearly the same as the task solved by training the foundation model for a LLM. The fine-tuning is just that, slightly adjusting probability distributions, but the task remains the same.
Training a video world model is much further away than directed autonomous driving with a human-given goal. Synthesizing video is interesting for making synthetic movies---like a Tiktok generator algorithm. You still haven't solved the hard part: drive with a directed goal, and be very safe against extremely unlikely tail distributions of events.
Look what commercial pilots have to do. Their training is heavy on the edge case dangerous situations that are quite different one hopes from ordinary everyday flying.
Here's an example of the difficulty of end-to-end training policy from observed data. Suppose you did the same for aircraft, a ML model which watched 10,000 commercial flights and synthesized everything that needed to be done. It would get really good at pushback from the gate, safety demonstrations, typical autopilot routes, and good weather landings.
How many danger situations? Maybe say 5 go-arounds because of potential runway obstruction in the training set. But the problem with that is the ML model has no idea that the actual danger was a "runway obstruction", it only had a few examples and ML models are notorious for dealing with with irrelevant parts of the huge bitstream in the data and making irrelevant correlations because they solve the train examples, like maybe the go-arounds were all at a certain airport or two which had lights & signs in a certain place, or it was always a particular cargo or military base which did it.
The ML system might never learn that it is actually any runway obstruction and what a collision is, something intuitive to humans, so it will happily slam into an obstruction in a different airport that didnt have the correlations that it picked up upon.
A human pilot on the other hand understands that concept without any training example ever and will train in simulation and understands dynamics of what might happen. Trained pilots have a strong mental model of aircraft aerodynamics and control systems and with practice intuitively understand and predict reactions, like they will know that they need to increase limits on a hot day at higher altitudes---something the ML system might not have seen in its training set correlated with a dangerous example. (And this is why the recent Boeing MCAS fiasco was so bad as they silently inserted something which violated the pilots' mental model of aircraft dynamics)