they can do statistical estimates based on knowledge of common sizes of typically observed objects, but that works for common things, plus motion based estimations. But the rare tail cases are one that matter and they violate statistical assumptions. Or it's using a huge computational burden to get at something which could be obtained far more economically, letting the computational budget be used for other important things.
Humans use stereoscopic vision as well as subtle autofocusing signals, which is how a one-eyed friend said he did depth perception. Neither of these are available on a Tesla. But Subaru has been doing stereo, without lidar or radar, for decades now and people think its system performs well. With an artificial system you can have cameras more widely spaced than human eyes, giving superhuman performance, i.e. parallax out to longer distances.
news.ycombinator.com
Subaru doesn't have the machine learning capability of Tesla, so their high performance probably comes from the right hardware sensors. With better ML and sensors, Tesla could do better still. Tesla is excellent vs competitors in lane keeping (steering control) as that's a ML-first problem, but performs less well in longitudinal speed control where physical distances & velocities and classifications of obstruction are primary.
I agree with not using lidar on a consumer purchased vehicle, too expensive and fragile and with a high power cost for enough performance. I think the lack of stereo (as well as lack of full all around camera coverage) is a significant deficiency which would improve performance at a low cost.
If every car had direct physical distance determination at shorter distances, this would also provide an enormous potential dataset for autolabeling (as they reverse time and see what objects were in the distance some seconds earlier). Tesla has some instrumented cars with radar or lidar probably to generate training sets but not a full fleet.