I'm very confident that V12 is not what Douma thinks it is in that video.
I'm not saying V12 is a monolithic neural network, but it is vastly different than V11 in that it gets rid of human semantics and heuristics in all parts of the architecture
Likewise, V12 makes no use of human concepts like lanes, stop signs
That will be a major problem trying to connect it to the maps and desired routing information. Humans are explicitly instructed on what a lane is, a stop sign and all the other semantic content of safe driving is about.
Going to V12 helps get rid of labelers, but it's going to make engineering the correct policy that much more difficult. A human driving instructor gives explicit feedback in words to people who do understand those concepts. In new situations, humans really do reason out thinking about stop-signs, lanes, and signs---not the intuitive gut-feeling that a purely observationally trained perception & policy grey goo stack would do.
Otherwise you're going to get the equivalent of a clever dog or chimpanzee who watches its humans drive, but don't fully understand the requirements. They could do a decent job of mimicking behavior around their training set but are otherwise entirely inappropriate drivers for a robo.
Maybe they have some new trick, but switching to Yet Another Totally New Architecture means practical robodriving is many years away.
V11 was totally dependent on autolabeled and manually-labeled human concepts like this.
You can't feed V12 "dirty" ideas like the V11 BEV and autolabeled world representation and expect a good output. e2e thrives on pure, raw data and you need to massage the architecture and data to get the outputs you desire.
They could still make ML mappings from the mysterious internal mappings of V12 into the previously labeled results for visualization. The difference that the labeled data isn't used in the primary loss function for training the net and it will be less reflective of the elements it uses for decisioning. Which will make it harder to craft human feedback when ChatGPT style "read everything" is insufficient.
LLMs already have problems making plausible but wrong answers. The equivalent here would be incorrectly, and confidently, driving into the wrong lane, or doing so with the wrong signal light. Doing the wrong thing when a human traffic control overrides signals.