Hi, I'm a new member here, I don't have a Tesla vehicle (drive BMW i3) but may get one soon. I work professionally in machine learning but not on vision or robotics, more classical prediction.
I think a significant problem with the vision only is likely to be the absence of stereo cameras. People are praising the Subaru EyeSight (which seems to work great without radar, but with stereo). I looked up about its history: it's been in development since 2008, with multiple revisions. Apparently there was a singular Japanese researcher with some great ideas back at its inception. This person as of 2018 has a new startup with an improved algorithm and chipset.
He discusses the problems with various approaches, which I think are relevant here. He supported stereo cameras over mono plus radar. (Mono cameras only weren't ever considered by anybody!!) I think it's safe to say he's an expert.
Mono vision relies on the neural networks to use illumination and shape to detect objects, and then it is limited to detecting known pre-trained examples, and will be confused by less clear ones. This could cause 'phantom braking' as the net gives a non-trivial probability of a danger from a weakly classified object and because of risk mitigation (need to take into account a low percent chance of a highly dangerous situation) you get phantom braking until the object is large enough to be classified as benign or not an object.
ITD Lab’s Saneyoshi maintains that stereo-vision processing is ideal for detecting generic objects without training. In contrast, monaural vision can fall short of detecting objects when it encounters something that it has not been trained on, he noted. “For example, Volvo’s self-driving technology reportedly struggled to identify kangaroos in the road,” he said. Kangaroos’ movements in mid-jump confused the mono camera’s vision processing.
On the other side, safety could be better with stereo as it gives direct physical prediction of an obstruction even if it can't be classified as to 'what it is' until later. The article gives the example of the Uber fatal accident. In a nutshell a mono image recognition neural net detected the crossing bicycle 1.29 seconds in advance (too late) whereas his new chipset on stereo would have done so 2.23 seconds.
The chipsets appear to be classical non-neural signal processing for direct distance calculation which are then fed into the later phases of the driver assistance.
Remember that the existing mono camera sets were designed with the assumption that it would be supplemented by radar. Elon is out of his depth on machine learning and bullshitting here. If he were going by "first principles" and analogizing to human performance, the cameras would be stereo, well behind the windscreen (for more weather resistance), multiple gimballed and much higher resolution. (existing cameras are 1280x960, not very high at all when dealing with far off objects moving at high speed)
The performance of the Subaru indicates that stereo cameras is a valid and successful approach, though personally I would want stereo vision plus high-resolution 77 GHz imaging radar. I think it's fine to ignore lidar.
I think Tesla has a very good neural network perception stack in the ML case with a great advantage of being able to push out updates to the fleet and gather data, but it's made to do too much work because of the camera limitations. (Their route planning code isn't great as far as I can tell but that's relevant more for FSD than straight AP, but it hasn't yet been turned into a fundamentally machine learning solution.)
If they were to deploy some thousands of cars with good stereo cameras and collect data for a retrain, I bet their solution would be excellent and the power of the fleet wide machine learning would shine.