How do you know from raw sensor data what "the driving situation" is at any time?
Raw sensor data would be handled by a perception network to crank out objects. This is the same as our manner of identifying significant objects to any task. That then becomes the
basic "driving situation". Higher order semantics about the "driving situation" are then determined by the use of other networks. The perception network might actually be multi-level, relying on identification of lines, circles and shapes first, then using a second network to spot wheels, bicycles, cars, legs, arms, and so on. There may be other decompositions which would provide greater mileage, especially if the same networks are supposed to support something like Optimus.
Is the "driving situation" something human labelled?
Yes. That's the way we understand it, so that's the way we would teach it to a learning automaton. In doing so, we can hope to instrument the thing to learn what it is doing and why because the semantic transitions between the networks are exposed and familiar.
How do you get that, what happens if there is more than one at the same time, unprotected left with a child in the crosswalk, and someone swerving into your lane?
That would be the highest semantic level and would likely be learned entirely by training because there are no semantics that we have developed other than "Do the right thing". That training would be based on high level notions such as "executing unprotected left", "child" and "crosswalk" as opposed to "Here are some pixels". If somebody who was looking hard at this found higher level semantics that could be trained and used by other networks, then so much the better.
I prefer this sort of controlled and structured approach because something like an LLM just screams hubris. "We don't understand everything about it, but it does amazing stuff, so we're gonna ship it." I have the same problem with modern software engineers using platforms, packages, virtual machines and so on; they don't know what it is that they're creating, but it does what they want, so they sell it. Later they find out that it does other things that they didn't want.