I think it's obvious that the abstraction of the world that the brain builds internally is far beyond what FSD is doing, and if you tried to reverse engineer the brain's internal representations while driving then it would be utterly incomprehensible, it definitely wouldn't look anything like FSD's 3D world. This is because the abstractions that you use to drive are learned and not explicitly chosen by Tesla engineers so they're not going to look pretty and fit into a bulleted list of objects. It's probably not even going to be in Cartesian space.
Tesla has dozens/hundreds of models that attempt to answer questions like, where are all the cars in X,Y,Z coordinates and their corresponding width/height/depth (that's what a bounding box is!). Where are the pedestrians in X,Y,Z, where the traffic cones, where are the signs, the traffic lights, the lane lines, speed bumps, and so on. I.e. this explicit list of feature/object detectors. Then what they end up with is what you see rendered on the screen, this 3D scene of objects like a video game. Then they take that and they feed it into this quasi-learned/hand-coded C++ planner system that produces the lateral and longitudinal control of the car. Then every version Tesla keeps revving these models over and over, "turning the data crank", they make the cone detector better, they make the velocity estimation better, they add more clips to the lane predicator, etc. But ultimately, the list of tasks they've settled on is not learned, they were explicitly chosen by engineers. This is not how humans drive. As long as the feature detectors are explicitly defined by engineers, it will never be as robust as you can be. It might be able to get pretty good, maybe even tolerable to use, but it will eventually hit some local maxima. It's definitely not going to lead into a cooking and cleaning Tesla Bot that way.
It's just like in the early days of computer vision, before deep learning, if you wanted to find a car in an image you couldn't just train a model and ask it to tell you where the cars are, it wouldn't scale. You'd had to hand-select different feature detectors for the parts of a car, a detector that looks for the shape of the wheels, the tail lights maybe, the geometry of the hood/trunk. Today when we want to make a model to detect cars, we don't hand choose features anymore, we train it on entire cars and then ask it where the cars are. If you peered into the network layers, it may have automatically built detectors for tail lights and wheels, and probably a lot of incomprehensible things, weird figments of cars and edges and shapes. But it'll be far more robust than doing it by hand. In a way, systems like Tesla's FSD are just repeating the same mistake of the past, just on a large scale. They aren't asking the models, where would you drive given this image? They are asking it for hand-selected features, where are the cars, the lanes, the signs, etc. Ultimately I think that approach is just doomed to be thrown out in time, and is only being built now to satisfy this business need to ship an intermediate product before Tesla tosses it for a more end-to-end approach that more closely imitates human driving behavior.