Is it possible that it is just a delay in processing?
This was discussed in the ON keynote, you can see it does a good job of predicting object trajectory through occlusions, as well as through multicam views (versions before Tesla Vision even have cars blinking in and out when they straddle two camera views, you see none of that happening here). See starting 18 minutes for an example:
This point was addressed also when they first introduced Tesla Vision and talked about RNN (Recurrent Neural Network). Before this, they used to use a single frames to do perception, and when objects get occluded, the object may disappear or have very wrong position predictions. See 1:09:
Blue is the video based perception with persistence. Orange are based on single frames. You can see when the pickup truck is not blocking the two cars, the single frame does ok (although even so, because the camera views were not synced, the orange results still have quite a bit of variation):
When the pickup truck blocks the two cars, the two completely disappear from single frame (while they still exist in the video based one):
When partially occluded, the cars still hold their positions in the video based, while for the single-frame, they are all over the place, even the orientation is wrong:
Given my car personally is still on a non-Tesla Vision version, I see a lot of the single frame artifacts in the visualization (cars blink in and out of visualization when straddling camera views, when partially occluded I see also the weird orientation changes with the car rotating like in picture above).
ON should further improve on above given it gives general 3D recognition of occupancy, while previous NNs are based on recognizing a certain object type, like a car or a person. From above you can see it gives a 2D bounding box based recognition and a top down recognition, but nothing like the 3D model of the occupancy network (which even has the rough shape of the object, which is achieve for "free").
And also ON works at much higher rates (100fps) and is very memory efficient (allowing Tesla to have longer persistence if they desire).
A Look at Tesla's Occupancy Networks
My guess is when it switches to parking assist mode (the mode where it shows a zoomed in top down view and the USS pings and distance measurement), they can reduce the viewing distance of the ON (ignore objects far away which USS can't detect anyways given they have 2-5 meter max range), which can allow them to boost persistence and resolution (using smaller size voxels). This may come at a later version however if they want to get something out working quickly first.