Large Fleet / Data
Now, if we were to judge AV superiority based on fleet size and data, companies like Mobileye with their full supervision fleet of 100k+ cars, NIO with ~50-100k+ cars, Xpeng with ~100k cars, and Huawei with ~5k cars would all be considered frontrunners ahead of Waymo and Cruise. They boast impressive arrays of sensors, including 8mp cameras (as opposed to Tesla's 1.2mp), *surround radars, *lidars and more powerful compute capabilities, such as NIO's 1000+ TOPS.
When it comes to neural networks, data acquired using radars or lidars typically yields superior results compared to camera data alone. Moreover, one cannot simply add radar or lidar data to pre-existing camera data at a later stage. Consequently, the data collected by other companies is far more valuable than Tesla's, as they can ground it more accurately. Tesla, on the other hand, relied on an outdated ACC radar for ground truth, which was limited to forward tracking of moving objects (not static ones) and had a narrow field of view.
In contrast, other companies train their vision neural networks with rich camera data, seamlessly fused with high-quality HD radar information (and in some cases, even ultra-imaging radar data), as well as high-resolution lidar data. This comprehensive approach to data collection and fusion ultimately leads to more robust and reliable neural network models
ML / Neural Network Architecture
It's crucial to remember that the cutting-edge ML and NN architectures of today didn't materialize out of thin air. Waymo, for example, had been using transformers long before Tesla adopted them. Similarly, other AV companies have been employing multi-modal prediction networks, while Tesla initially relied on their C++ driving policy before eventually making the switch.
Tesla were running an instance of their c++ driving policy (planner) as a prediction of what others would do. They then ditched that and moved to actual prediction networks that others have been using for years. Then they finally caught up and moved to multi modal prediction network which others had also been using. Unlike Tesla fans, Tesla tells you exactly what they are doing and not doing in their tech talks, AI conference and software updates. Its the fans that invent mythical fables and attach it to Tesla.
Heck just days ago Elon admitted their pedestrian prediction is rudimentary.
(Compare that waymo who has been doing this for a long time. Paper:
[2112.12141] Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving Blog:
Waypoint - The official Waymo blog: Utilizing key point and pose estimation for the task of autonomous driving)
When it comes to driving policy, many companies have been incorporating ML into their stacks long before Tesla. In fact, at AI Day 2, Tesla presented a network strikingly similar to Cruise's existing approach. Meanwhile, Waymo had already deployed a next-gen ML planner into their driverless fleet.
Of course, while Tesla was still toying with introducing ML into their planning system, Waymo was already using an ML planner and even releasing a next-gen version for their driverless fleet.
Simulation
As for simulation, Elon Musk initially dismissed its value back in 2019. Back in 2019, Elon musk called it "doing your own homework", basically saying it was useless. This was when Waymo, Cruise and others were knee deep in simulation and using it for basically every part of the stack.
In one of Andrej tech talk he even said that they weren't worried about simulation but on their bread and butter (the real world).
In late 2018, The information released an report saying that Tesla was in the infancy of simulation.
Fast forward to AI Day 2021, and he was singing a different tune, declaring that "all of this would be impossible without simulation." Instead of building their own simulation tech, Tesla opted to modify and use Epic Games' procedural generation system for UE5.
Conclusion
So, if we were to compare Waymo and Tesla in detail, it's clear that Tesla lags behind in several key areas: ML networks, compute power (Waymo's TPU4 vs. Tesla's GPU-based training), sensors (Waymo's higher quality cameras and sensor coverage), simulation tech, driving policy, and support for all dynamic driving tasks. However, this doesn't necessarily mean that Tesla is trailing overall; one could argue that they're still 10 years ahead.
But the fact of the matter is, these facts still exist and remain valid.
It's certainly tempting to lean into the data advantage argument, but I rathe we examine the details and ask logical questions, rather than relying on buzzwords and vague claims. We should consider how data truly affects the perception, prediction, and planning stacks of AV architectures and critically assess the extent to which data augmentation and simulation can compensate for any shortcomings.
- If billions of miles of real-world data are indeed essential, then why aren't the millions of tourists who visit San Francisco / Phoenix annually from all over the world endangered by Waymo vehicles? They are, after all, not part of the perception dataset.
- What about the countless tourists who drive into San Francisco / Phoenix and are not rear-ended by Waymo? Their presence, too, is absent from the perception dataset.
- Why doesn't Waymo mispredict pedestrians' actions and collide with them, or mispredict other vehicles' movements and sideswipe or crash head-on with them? Clearly, these tourist behaviors are not in the prediction dataset either.
These concerns pertain to the perception and prediction stack, but let's also examine the planning and driving policy stack. Within the roughly 200,000 miles of divided highways in the United States, how many miles are truly unique? How many miles are not well represented elsewhere?
From my perspective, a mere 1% of these highways are genuinely distinctive—for example, on-ramp interchanges, cloverleafs, short on/off ramps, close on/off ramps, on/off ramps requiring double lane merges, etc. However, what do you believe? Are 90%, 75%, 50%, or 25% of these highways unique?
This will help us figure out how much data is needed and how much can be covered through data augmentation & simulation at scale.