You're assuming the occupancy network is trained like 2D segmentation, when we know that's not the case. It's trained on multiple, overlapping views with a defined time dimension. If what the network is ultimately detecting is 3D volume, it doesn't need to have a similar representation of an object in the training data to know that a 3D volume is perceived in a particular way via bi/trinocular views over multiple successive frames.
E.g. a garbage bag blowing across the street will be perceived by the occupancy network as a 3D volume because of the parallax in the multiple camera views telling it that those pixels are closer to the vehicle than the background, and how that parallax changes as the bag itself moves and the vehicle camera views move toward it.
I didn't say representation of the object. I said 'similar representation of what it sees'. I will explain later (work)
Got this Tweet as a promotion from Mobileye today, I thought it was a really interesting point of comparsion:
You can really see the limitations of traditional image-space segmentation, as highlighted by Ashok's CVPR presentation. They seem to be doing the vehicle and driveable space segmentation in individual 2D views.
You can clearly see:
1. The density at the horizon causing a lot of noise in the distance of the VIDAR representation
2. The VIDAR representation exhibiting some fisheye warping, presumably due to the stitching together of of individual 2D viewpoints
3. Driveable space not persisting behind vehicles and other obstacles
In addition to what i said earlier. In context of this current topic, Its worth pointing out that this is nowhere close to the resolution and recall you get with modern lidar.
&t=11601s