I'm not an AI expert by any means, but based on my limited understanding, that's actually not encapsulated in #1, which was:
Obviously the simulations don't have enough data to cover all of the interesting edge cases that people are seeing, or else we wouldn't be seeing them. We don't know if they're even close; it could be five orders of magnitude too few.
More importantly, you can never assume that generated simulations created by a GAN (
generative adversarial network) will
ever become a representative sample of real-world conditions. GANs, for folks who aren't familiar or need refreshing, generate new training data by imitating existing training data. In Tesla's case, this means creating new sequences of input video frames from multiple angles that could plausibly occur in the real world, using a large corpus of existing input video as examples of what the real world looks like.
The problem is, you really can't assume that the underlying training data used to train the GAN is sufficiently diverse (or even within orders of magnitude of being sufficiently diverse). Thus, they potentially will never be able to find huge swaths of edge cases without more training data, because GANs trained with the existing training data will never rule in those types of edge cases as being plausible.
And at that point, somebody has to tell the fleet to send back mountains of video clips with specific combinations of tags, e.g. pedestrians standing in a bike lane or whatever (yes, this example is a joke, but you understand the point) and add them to the training sets for the GAN. For the problems that they know about, simulation is a great way to iterate and improve that aspect of the model, but at some point, it's like playing Whac-A-Mole. You're never going to find every possible unusual road condition that way, because there are just entirely too many, realistically speaking.
Now here's where it gets interesting. If there are cases where a driving model is just
marginally good enough, there's a decent chance it is marginal because there isn't much driving model training data that covers those edge cases. If a model change negatively affects a large number of those marginal cases, and if those happen often enough, the average overall behavior of the driving could get worse even when it massively improves some other problem that occurs less frequently than the sum total of all of those individually rare edge cases.
Worse, because such edge cases are underrepresented in the training data, the simulation results won't necessarily tell you that the average behavior is going to get worse unless they're somehow compensating for that underrepresentation (and if they knew that those edge cases were underrepresented, they presumably wouldn't still be underrepresented in the training data, so that seems unlikely to actually be possible).
Thus, it is entirely possible for a release to seem better on average in simulated driving before the release and still be considerably
worse on average in the real world, particularly if the vehicles chosen for the early rollouts are not a representative sample of the real world.
And streets in San Francisco, Palo Alto, Mountain View, Fremont, and other similar areas are likely massively overrepresented in both the rollouts (particularly in the early stages involving employees) and in captured data, simply because they are massively overrepresented in terms of the number of cars on the road. So unless the experiment design is
quite nonrandom,
massively biasing vehicle selection based on geographical location in an effort to balance out the nonrandom geographical distribution of the vehicles themselves, we can probably safely say that neither the data that feeds into the simulations nor the the early rollout vehicles are likely to be particularly representative samples of the real world.
So here's what I'm wondering: Why doesn't Tesla allow the MCU to upload a firmware supplement bundle to the FSD computer that adds a few new models that run in shadow mode for comparison purposes? If a fault occurs while running one of those models, ignore the fault, stop running the model, and report the failure. If they kept the models entirely in RAM to minimize flash wear, it seems likely to be mostly harmless.
With that approach, Tesla could silently push bundles out to a large percentage of the fleet when on Wi-Fi on a daily basis, enabling them to get much more data about how each model change improves things or makes them worse. Assuming they have enough people to then analyze the incoming telemetry data or manually tag or verify AI-based auto-tagging of video captured when the driving decisions generated using the outputs from those shadow models would be too different from the driving decisions based on the actual prod model outputs (where feasible), they would be able to potentially iterate on the models more quickly, rather than waiting for a release push and hoping that it actually makes things better.
Alternatively, why doesn't Tesla build NNs that are trained on their simulation data's metadata — things like how often certain combinations of tags occur in close proximity, how often particular tags move along certain vectors, etc. — and run those on every car in the fleet in an effort to identify road features, conditions, behaviors, etc. that are underrepresented in the simulation training data, and then capture more data to cover them? (I'm assuming they don't do this.)
Or both. My vote would be both.
Anyway, it seems to me that simulation is great, and it is absolutely critical as a part of the QA process, but assuming the telemetry is good enough and assuming there's enough extra horsepower to do it, daily updated live A/B experiments at the model level seem like a better way to move fast and (pretend to) break things.