VII. CONCLUSION
The goal of this research was to assess between- and within vehicle variation for an L2+ system, including driver monitoring, in three key scenarios. To this end, three Tesla Model 3 vehicles displayed significant between- and withinvehicle variation on a number of metrics related to driver monitoring, alerting, and safe operation of the underlying autonomy.
These results suggest that the performance of the underlying artificial intelligence and computer vision systems was extremely variable, and this variation was likely responsible for many of the delays in alerting a driver whose hands were not on the steering wheel. Ironically, in some cases the cars seemed to perform the best in the most challenging driving scenarios (navigating a construction zone), but performed worse on seemingly simpler scenarios like detecting a road departure.
This finding highlights a common misconception that what humans perceive to be hard in driving may not necessarily be what an autonomous system finds difficult. It may be that the cones were more easily detected in one software version as opposed to the road edges in a much more gradual drift in the road departure test. Another possibility is that engineers spend more effort on the more difficult problems and spend less time on seemingly easy problems. Whatever the reason for such variable and often unsafe behaviors, these results indicate that more testing is needed for these vehicles before such technology is allowed to operate without humans in direct control.
These results also suggest that more effort is need on developing consistent and accurate alerts when L2+ systems are not performing as expected. These results should be interpreted in light of the discrepancies in the software/hardware configurations of the vehicles, which present a confound for assessing the nature of performance variation. Despite the very similar configurations of Cars 1 and 3, they completed the tests using different versions of software. Car 2 possessed the purported “full self driving chip”, so in theory should have the most advanced Autopilot system, but this car objectively performed the worst.
Such results also indicate that the concept of over-the-air updates needs to be revisited when safety-critical functionalities may be changed. While agile software engineering techniques may be suitable for smartphones and other similar devices, these techniques likely cause significant problems in safety-critical systems. Unfortunately, these processes have never been formally studied or evaluated by a regulatory body. Indeed, these results highlight the need for more scrutiny of the cars and software embedded in them, as well as the certification processes, or lack thereof, that allow these cars on the road.
Lastly, these results highlight that the post-deployment regulatory process that NHTSA uses in Fig. 1 to protect the public against unsafe vehicle technologies is ill-equipped to flag significant issues with L2+, or in the future, self-driving cars. These results dramatically illustrate that testing a single car, or even a single version of deployed software, is not likely to reveal serious deficiencies. Waiting until after new autonomous software has been deployed find flaws can be deadly and can be avoided by adaptable regulatory processes. The recent series of fatal Tesla crashes underscores this issue. It may be that any transportation system (or any safety-critical system) with embedded artificial intelligence should undergo a much more stringent certification process across numerous platforms and software versions before it should be released for widespread deployment. To this end, our current derivative efforts are focused on developing risk models based on such results.