It knows. read up on the subject!
It works in current software, don't you think? This is how the EAP and FSD works.
It does not "know". Advanced systems can sort of infer relative motion given the context of the global scene, but this is always less reliable than direct distance measurement from lidar or simultaneous stereo. You can't simply create information from scratch, you can only guess at it. Your guesses might be very good in many cases, but having the information directly is both more reliable and less computationally intensive.
As for how the current software works, I can see very well that it has great difficulty estimating the exact 3d position of neighboring cars. They're trying to do this and not doing nearly well enough. For example, a lot of TACC brake checks that I've experienced in V9 result from a car slightly ahead of you in a neighboring lane suddenly being determined to be in your lane because it is getting the 3d position wrong -- some combination of inaccurate bounding box and inaccurate distance guess. I suspect their distance guesses are a combination of apparent size and temporal stereo from consecutive frames, probably just by throwing everything into the NN and letting it guess from whatever information was found to be most useful in the training data, meaning it will take both size and inter-frame disparity into account, plus all sorts of context that you can't even put your finger on and may actually be unhelpful (e.g., lighting conditions or the color of the car), because that's how deep learning works.
Note that both 2d bounding boxes and depth information are always
guesses in a system like this. Even if they get to pixel-level labeling with masks instead of bounding boxes, it's still a guess and will sometimes be wrong. (Note that masks would also require a lot more compute power -- presumably a lot of HW3's extra power will be used for this rather than increasing frame rate.)
With more time and particularly with more computing power I think their guesses will get better over time, but direct measurement of 3D extents from lidar or stereo would clearly be superior. Tesla has handicapped themselves by not including any kind of practical direct rangefinding. (And obviously ultrasonic is a joke at highway speeds, even if you call it "sonar" and brag about its 360-deg coverage, so that does not help them much.)
Even better is multiple sensing modalities acting together, like camera + stereo + lidar + radar (+ massively more compute power), like the big boys have. Failure modes of one modality are balanced by the others. Tesla does not care to do this for real, though, they want to sell sexy cars to consumers right now, and using a better sensor suite would kill their chances of doing that in any kind of volume. (Less sexy + more expensive = business fail in the short term.)
Edit: Note also on the subject of inter-frame disparity (temporal stereo) -- the most important objects to estimate distance to very accurately are neighboring cars, which will generally be moving very close to your speed and so the inter-frame disparity will basically just be noise. (During my normal commute conditions adjacent lanes are often moving at nearly identical speeds When there are speed differences, the difference in speed relative to your speed is very small, which gives temporal stereo a really hard time.) They really only have apparent size to go on in this case, and I think that's why TACC in V9 brake checks a lot more frequently -- you can see it in the dancing cars on the display.