So I have some of the required expertise (BS Comp Science) and have had some involvement in the area, although more conceptual than technical.
I agree with Musk that it is a software problem, and think that a heavy focus on hardware and sensor fusion is misplaced. There seems to be an accepted wisdom that the more sensors you throw at the problem, the better the outcome. I don't believe this is entirely true.
Going back to the "human" example: a regular human with a reasonable IQ and regular senses (2 eyes, 2 ears, and the ability to extend the visual field by moving the head around) can drive safely in most situations, and is able to adapt to new situations pretty quickly with minimal instruction.
But it is possible for a human to drive a car perfectly well with much less than that. Just one eye will do. And that one eye could even be colour-blind without a significant reduction in the driver's ability to drive and adapt to new situations.
The trillion-dollar question is: how do humans do it?
Well, we know that there are a few skills that are employed, and a few tricks in the way that we deploy them. There is a good article on Visual Expert (
link) describing what is going on when a driver reaction is needed, but in summary:
Sensation - there is a shape in the road
Perception/recognition - the shape ahead is a dog
Situational awareness - the dog is jaywalking and has not seen me coming
Response selection & programming - slam on the brakes
Each of these skills is a collection of abilities, and it is some (not necessarily all) of these abilities that need to be replicated in software in order to have fully autonomous L5 vehicles.
What we see the most of at the moment, is
Perception. The ability to recognise objects is a mature software solution that is fairly straightforward to implement with minimal hardware. Face detection on a smartphone app, for example.
The AP display in front of the driver is a simplistic view of the current state of this "skill". We know that Tesla have internal builds that are far more perceptive, having the ability to recognise vehicle types and road signs, for example.
What is much more of a challenge are the
Situational awareness and
Response selecton pieces. In a human, we subconsiously apply the following abilities in a fraction of a second:
Threat determination - will I hit them?
Intuition/experience - they will not see/hear me until it is too late for them to jump out of the way
Path prediction/Intent of others - is the oncoming car on the other side of the road going to stop / is the car behind too close to stop / where will the dog be if they don't change speed or direction before I arrive in that part of the road
Determination of vehicle capability - can the car even stop in time or should I try to avoid
Self-preservation & social prejudice - hit the lamp post, the oncoming car, or the dog
Response - brake hard, sound horn
At the moment, support for these skills in AP is very simplistic, and I even doubt whether a NN is involved at all with TACC or AEB after the Perception stage. Perhaps this is actually OK - building these abilities in a NN will be a tremendous challenge to get to human-equivilence, and there is no rule that says that NN alone is the right approach.
TL/DR;
Yes, the hardware is capable, but I think this the wrong question. The question that should be asked is: can the software ever be capable enough to safely control the vehicle in any given traffic scenario?