But where has anyone shown that these additional systems can make up for the deficiencies (if any) in car vision? We KNOW, at least in theory, that vision+acceleration+touch is adequate for driving, since humans do it every day. On what do we base the assertion that vision+lidar+radar (or some other combination) can substitute for this? There have been MANY people shouting how we need radar/lidar/porridge etc etc for a car to self-drive, but based on what evidence? Other than "common sense" that more sensors are better (and "common sense" is rarely common and only infrequently sense in my experience).
Camera vision has lots of deficiencies. It can misclassify objects. Camera vision also works less reliably in certain conditions like heavy rain, dense fog or total darkness. In fact, we see these same deficiencies in human vision. There are many accidents by humans that happen in rain, fog or darkness where visibility is reduced. Lidar and radar make up for these deficiencies. If your camera vision misclassifies an object, lidar will still detect the presence of the object to avoid collision. And lidar can also classify objects. So, you will get more reliable classification of objects with both vision and lidar than with vision-only. Lidar also works very reliably in total darkness when vision will be less reliable. HD radar is very reliable in dense fog and heavy rain when vision will be less reliable. So yes, vision+lidar+radar will absolutely be more reliable than vision-only.
I know one argument that Waymo has given for why they use vision+lidar+radar is that the prediction and planning stack rely on the perception data to work. If your prediction and planning get less than complete or bad perception data, they will make more mistakes. By using vision+lidar+radar, Waymo wants to give their AV the best, most complete, perception data possible, in order to give prediction and planning the best chance at making the right decisions. With vision-only, your prediction and planning stacks will be entirely dependent on the vision data. If it is not good enough, your prediction and planning stacks will be handicapped.
I think the debate between vision-only and sensor fusion is basically a debate about the march of 9's. Nobody denies that vision-only can drive a car. The question is can vision-only drive with 99.99999% reliability and can we solve it in a timely manner? Remember that most driving is actually relatively easy, it's the last bits that are hard. As you alluded to, the proponents of vision-only are counting on the theory that vision-only should be adequate. But what if vision-only is not quite good enough and it solves like 99% of FSD and then gets stuck? The proponents of vision+lidar+radar don't want to take that chance, they want the best chance at achieving 99.99999% reliably. And how many 9's are "good enough"? For consumer cars, 99% FSD might be perfectly adequate. For robotaxis, 99% FSD is not good enough. So for consumer cars, vision-only makes a lot of sense IMO. For driverless robotaxis, vision-only is a non starter IMO.
I see two possibilities:
1) Vision-lidar-radar is safer than vision-only but is too costly and too limited to geofenced areas. Vision-only works everywhere and is 1.5x safer than humans. All things considered, vision-only is considered "safe enough" so it wins.
2) Vision-lidar-radar is 50x safer than humans and the costs come down enough. Vision-lidar-radar wins out because society prefers 50x safer to just 1.5x safer.