Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

AI experts: true full self-driving cars could be decades away because AI is not good enough yet

This site may earn commission on affiliate links.
Why the attack? Either the vision beta is good or not, if it works than we are at a state better than humans. It is not a high bar, it is a low low bar. Such a low bar that 8.x is probably already there. The second statement admits uncertainty.

FSD Beta V8.2 was not better or safer than human driving. It was not even close to human safety. There were frequent safety disengagements. And even if vision-only V9 "works", it does not necessarily mean that it will be better or safer than humans. Remember that there is a big difference between being "good" and being safer than humans. It is actually a very very high bar. So even if we get videos of V9 doing X miles with no human intervention, that does not mean that FSD Beta is safer than humans. I hope you can understand the difference.

Don't get me wrong. I am not saying FSD Beta V9 will be bad. It might be amazing. I am just saying that we have no proof yet that it will be safer than human driving this year. To suggest unequivocally that it will be this year, is silly IMO.
 
Sorry but this is simply an unrealistic objection strategy, proposing hypothetical nefarious and technically sophisticated actors, dedicated to the sabotage of infrastructure that is actually much harder to disrupt than what exists today.

Honestly, I got two paragraphs into your response and realized you're just throwing around word salad. Not really worth much of a response other than saying you severely underestimate the complexity of what you're talking about, and you're attempting to complement observable reality with a delayed sensor network that can be attacked in countless ways.

What benefit does such a system offer, exactly? Why would I need more information what what I can see and hear outside my own windows? I'm able to drive my vehicle in nearly any condition you could present to me without the need for these extra external sensors, but for some reason we're considering them a benefit at the least if not a requirement for some future self driving system? Seems like it's a solution begging to find a problem.

It would be easy to look at cars with cameras, radar and lidar from the 80's and from today and think that autonomous driving has not really changed all that much. But that would be wrong IMO.

It would be easy to conclude that, because the founders in this space are almost all still in this space, and doing the same work they were doing back then.

The software under the hood is radically more sophisticated.

In that the software has matured, yes. But all software improves over time. The basis for the work is still fundamentally the same- vision processing, using neural nets to detect objects, and traditional code to produce behaviors. Adding more "if" statements doesn't necessarily fundamentally change software. :)

The computer vision is better.

Well, we have more computing power, so it's certainly faster. And memory is cheaper so we can store more in various types of RAM. But again, there haven't been fundamental changes in how this work is happening. The biggest breakthrough in computer vision has been the ability to run neural networks faster.

Autonomous vehicles today can do a lot more.

You didn't read the history of NavLab did you? NavLab 2 drove cross country, completely un-aided by human drivers for all but 50 miles. That's 50 miles of a nearly 2900 mile trip, in 1995. Nineteen. Ninety. Five. Now, if Elon's premise that "all we need is data" were true and "exponential growth" was in any way a reality in this space, how much more improved would a cross country drive in 1995 be some 26 years later?

I would also argue that the engineering of the sensors is probably more advanced as well. The cameras, radar, and lidar are probably better engineered and better quality today than they were in the 80's.

Obviously. And yet, all that time ago, while most of these digital sensors were still in their infancy, vehicles were able to complete these tasks and challenges. How many orders of magnitude better is a CCD sensor now compared to 1995, and yet the performance of autonomous vehicles hasn't exactly kept pace. So clearly the problem isn't data collection or purely down to sensors. And when you drive at a problem for almost 4 decades without making major progress, you need to re-assess your approach. Instead, start-up culture has taken over and bilked tens of billions from investors that don't know the difference between marketing and speculation, versus reality.
 
  • Informative
Reactions: Matias
You didn't read the history of NavLab did you? NavLab 2 drove cross country, completely un-aided by human drivers for all but 50 miles. That's 50 miles of a nearly 2900 mile trip, in 1995. Nineteen. Ninety. Five. Now, if Elon's premise that "all we need is data" were true and "exponential growth" was in any way a reality in this space, how much more improved would a cross country drive in 1995 be some 26 years later?

Are you referring to NavLab 5? If so, I believe that was basically just lane keeping on the highway using camera vision. So it was relatively basic driving. So my point still stands that the autonomous vehicles we have today, like Waymo, can do more driving tasks and are more sophisticated, than what they did back in 1995.

lF5AslZ.png


Having said that, I do agree with your overall point about data. Solving autonomous driving is more than just data.
 
As I've noted in other discussion threads here, I work in a related field (computer-vision, object-detection, recognition) and I couldn't agree more with Dr. Mitchell and Dr. Cummings. A good related read that is linked to in the original article is this one: How do you teach a car that a snowman won’t walk across the road? – Melanie Mitchell | Aeon Ideas

There is a HUGE gap between where we are and where we need to get for truly autonomous self-driving and as alluded to in this article, that requires huge, fundamental leaps in the field of AI / ML and there currently isn't anything even remotely on the horizon that shows any promise of solving these problems. Currently, ML systems using CNNs, RNNs, or Vision Transformers or whatever the latest and greatest tech is, can all be fundamentally reduced to being dumb building blocks that do a basic job with no "smarts" to them. Need to detect STOP signs?... train on thousands of images of STOP signs and then your network will do okay. Need to still detect it when a person is holding one up, or when it is partially occluded by a tree? Well, collect and label a whole bunch of unique instances of those scenarios, and after all that effort you have a slightly more robust STOP sign detector. Oops, now your fantastic STOP sign detector is detecting a STOP sign painted on the back of a van. What do you do now?

The answer, is not train your network not to detect stop-signs on vans. If you do that, you will also train your network not to ping on STOP signs that look very similar to the painted STOP sign in the examples you trained it on as non-stop signs. What you really want, is a "smarter" algorithm that can reason about the world. That understands the scene, that knows that the STOP sign is painted on the back of a van and isn't a "real" stop-sign. It is just another example of what Dr. Mitchell talks about in the article I linked above. Now sure, you can come up with heuristics and hacky logic to try and account for this one edge-case, but the fact of the matter is that there will always be an infinitely large long-tail of scenarios where you need something "smarter", with common-sense about how the world works to actually reason about the perception of the scene and make reasonable decisions. Currently, there isn't even a hint of the field of AI being remotely close to achieving this. Hence the DARPA program to try and train up algorithms to achieve a level of cognitive capabilities to match an 18-month old!

Waymo and others are trying to hack it by controlling things as best they can, relying on as many crutches as they can to avoid having to solve these harder problems, and I think they will and have gotten quite far with this approach, but I just don't see these ever working in a general sense for the very same reasons described above and in the articles. I stay away from the main Autonomous driving thread because some people are extremely opinionated and mostly seem to opine from a place of wishful thinking, rather than any basis in the facts of where the current state of ML and AI really is. Personally, I don't think L5 FSD will happen without huge leaps in the field of AI and ML. I don't even see any reason for optimism that these leaps could possibly happen in the next 5 years. As a practitioner in this field, I honestly have seen almost nothing interesting and of real value come out of research in the past few years. There is a lot of hype and a lot of crank turning and tweaking and performing incrementally better at some basic task on some basic dataset like COCO/Imagenet, but there has been nothing that gets us any closer to moving past training dumb as doorknobs detectors and classifiers.

Now personally, I just want a really good AP experience as an aid to me as a driver, and Tesla already has something pretty good, and I think Tesla and others can make fantastic AP aids using current sensor tech and ML/Computer-vision algorithms and capabilities. I do feel terrible for everyone who has been hoodwinked into spending money on FSD. It is just outrageous that this happened at all and while I want Tesla to succeed in the long-run, I absolutely want to see them raked over the coals for their FSD snake oil.

I think with the current tech on hand, and the current state of incremental advances, we would either need all cars to be on some sort of mesh-network for truly L5 autonomous self-driving (highly unlikely that ever happens), or we will need to make our peace with having less capable "self-driving", whether that be being restricted to well-controlled, geofenced areas, or highways, or only regions with up-to-date HD maps, or just always requiring an attentive driver at the wheel. Perhaps we will get to reliable L4 without too many restrictions, but even that seems pretty iffy to me at the moment.

Great discussion.

I think your points make sense - I see these as issues for all autonomous companies.

How much issue do you see with Tesla's 'psuedo-lidar' approach to mimicking Lidar? This seems to be the main "bottleneck" to theoretically allowing them to asymptote to similar performance of competitors.
 
Again, you don't need AI to do everything. You can get far enough with regular programming, using AI for detection and some other things.
I strongly doubt you need to basically mimic the entire human brain to do self driving.
 
  • Like
Reactions: diplomat33
Honestly, I got two paragraphs into your response and realized you're just throwing around word salad. Not really worth much of a response other than saying you severely underestimate the complexity of what you're talking about, and you're attempting to complement observable reality with a delayed sensor network that can be attacked in countless ways.
OK I get it, you want to be able to throw out a dismissal and more unsupported statements, as you did the first time, without actually responding to the points. Accusing me of word salad is short for you didn't read it (and you said so), didn't care to understand it, and want to double down with more throwaway dismissal. Messages here on TMC can run long and be thoughtful and thoughtfully answered, that's why I'm here instead of the YouTube comment section.

You might at least try counting a few of the countless ways, and point out why they're so easy to attack compared to what's outside your room right now. And "delay"? Look-ahead is negative delay.

BTW to your separate interaction with diplomat33: I too have been an engineer since the 80s, and I know all about what cameras and image processing could do then vs. now. Kudos to you and any engineers that made a demo back then, but I hope I never become encrusted with the inability to recognize the delta over forty years of development, in every relevant field. Have a nice day.
 
How many orders of magnitude better is a CCD sensor now compared to 1995, and yet the performance of autonomous vehicles hasn't exactly kept pace.

I have to disagree. In 1995, NavLab 5 was just doing demos of "autosteer" on the highway. Today, we have robotaxis that can drive on their own on city streets, handling traffic lights, intersections and more. I would say the performance of AVs has increased dramatically since 1995.

And by the way, this is what computer vision looked like in 1995:

LbiaH9H.png


This is what computer vision looks like in 2021:

Tesla-full-self-driving.jpg


That's a pretty drastic improvement, I would say.
 
Again, you don't need AI to do everything. You can get far enough with regular programming, using AI for detection and some other things.
I strongly doubt you need to basically mimic the entire human brain to do self driving.
I think that remains to be seen. I agree that you don't need to be able to create Skynet to achieve L4/L5 autonomy but it also isn't clear at all that you can simply hack general-purpose L4/L5 FSD using existing technology and regular heuristics and forests of if-else statements. I certainly think you can achieve excellent ADAS systems with that approach. I even think that with a robust enough and complementary enough sensor suite, coupled with good algorithms and sensor fusion, you can maybe start to achieve L4 FSD. But even in those scenarios, the concerning aspect is that without any sort of AI that can "reason" beyond the dumb-as-doorknobs type of reasoning, it is hard to make a system that can defend against a long-tail of weird, real-world edge-cases reliably and robustly. Humans can both recognize, and solve these situations trivially because our minds work nothing like conventional "AI" and we have the gift of "common sense"... something algorithms are completely deficient in.

If you can aid your system with a host of cutting edge sensors and wide variety of sensor modalities that make it easier to reason about the world, you can start to overcome the extra complexities of a vision-only system in the absence of a far superior "AI" system. This is Waymo's approach and clearly they are having quite a lot more success on that front, but still with a lot of limitations and nothing that indiciates they can easily generalize to wide swaths of cities and the country without a lot of extra effort.

My personal opinion is that in a few years we will have excellent ADAS systems that will make driving a lot more pleasurable and less taxing. I am not however convinced at all that we will get to useful L4 autonomy that can work in a general sense without additional breakthroughs in the field of AI/ML/Computer-vision.
 
I have to disagree. In 1995, NavLab 5 was just doing demos of "autosteer" on the highway. Today, we have robotaxis that can drive on their own on city streets, handling traffic lights, intersections and more. I would say the performance of AVs has increased dramatically since 1995.

And by the way, this is what computer vision looked like in 1995:

LbiaH9H.png


This is what computer vision looks like in 2021:

Tesla-full-self-driving.jpg


That's a pretty drastic improvement, I would say.

Those screenshots, IMO, really drive home my point for me. You're basically looking at improvements in rendering dots over a video feed, and that Tesla image looks pretty disappointing for 26 years of improvement. Still drawing bounding boxes and points to denote lane boundaries.

It's a shame that CMU has reorganized their site so much and removed so much content over the years, because they used to have some truly fantastic videos describing the work going into their vision systems, and showing off their self driving vehicles. I seriously recommend people go look for that content, because it will put everything into a great perspective.

It should also be pointed out that Waymo and MobilEye are focusing on traditional code to handle driving. Traditional code is one place where time in the game will make the biggest difference, because you're going to have to create, test, and code all of the scenarios that the vehicle will find itself in and properly handle. Simply put, lines of code takes time. Several autonomous vehicle companies appear to be all in on the new age neural nets can do everything school of thought, and I don't think we're going to see anything positive from them for a long time if ever.
 
  • Disagree
Reactions: diplomat33
I have to disagree. In 1995, NavLab 5 was just doing demos of "autosteer" on the highway. Today, we have robotaxis that can drive on their own on city streets, handling traffic lights, intersections and more. I would say the performance of AVs has increased dramatically since 1995.

And by the way, this is what computer vision looked like in 1995:

LbiaH9H.png


This is what computer vision looks like in 2021:

Tesla-full-self-driving.jpg


That's a pretty drastic improvement, I would say.
I agree, and in general, I think the entire premise that there hasn't been a lot of meaningful improvement in the field of computer-vision when it comes to self-driving is very flawed. Only in the most superficial sense are systems from the 90s comparable to those in 2021. Let's just look at a single component - pedestrian detection systems. Those were basically horrible in the 90s. Till the advent of convolutional neural nets and the advances over the last decade, the best you could do was rely on some sort of algorithm that used SIFT or HOG features to determine if a person was in the image. These approaches were terribly inefficient, extremely finicky, and never really worked very well. You'd never be able to rely on something like that in an FSD system.

Today, with the power of CNNs, and millions/billions of images to train them, you can basically train a pedestrian detector that is way faster and more reliable than a human for real-world examples. They will still break down in the case of adversarial examples because they lack "common-sense", but they are good enough as a dumb component block to a self-driving system. The same goes for lane detection, road sign detection, traffic-light detection, etc. All of that didn't exist back then, or if it did, was pretty terrible. Those older algorithms were also running on huge banks of computing infrastructure rather than something like Tesla's integrated, low-power solution which would have been incomprehensible in the 90s.

FSD is an extremely hard problem. All the hype from the last decade has not helped, but it just is a fundamentally hard problem and because we haven't solved it yet, it is easy to dismiss all the real progress that has happened over the past 2 decades in this field and related fields. Real progress has happened. We just need a lot more progress to achieve the holy grail imo.
 
Those screenshots, IMO, really drive home my point for me. You're basically looking at improvements in rendering dots over a video feed, and that Tesla image looks pretty disappointing for 26 years of improvement. Still drawing bounding boxes and points to denote lane boundaries.

Respectfully, if you think the only difference between those two screenshots is how it renders dots on the screen, you don't understand the technology.

It should also be pointed out that Waymo and MobilEye are focusing on traditional code to handle driving. Traditional code is one place where time in the game will make the biggest difference, because you're going to have to create, test, and code all of the scenarios that the vehicle will find itself in and properly handle. Simply put, lines of code takes time. Several autonomous vehicle companies appear to be all in on the new age neural nets can do everything school of thought, and I don't think we're going to see anything positive from them for a long time if ever.

I don't think that is correct. Nobody is trying to solve FSD with just traditional code. Everybody uses Machine Learning and Neural Networks. Waymo does not use just traditional code to handle driving. AFAIK, Waymo uses a combination of NN and traditional code. In fact, Waymo has a lot of NN for perception, prediction and planning, which is not traditional code.


 
Last edited:
It should also be pointed out that Waymo and MobilEye are focusing on traditional code to handle driving. Traditional code is one place where time in the game will make the biggest difference, because you're going to have to create, test, and code all of the scenarios that the vehicle will find itself in and properly handle. Simply put, lines of code takes time. Several autonomous vehicle companies appear to be all in on the new age neural nets can do everything school of thought, and I don't think we're going to see anything positive from them for a long time if ever.
Literally every autonomous vehicle company that knows what they are doing will be using several neural nets as component blocks to their self-driving solutions. It is silly to even suggest that anyone is going to get anywhere remotely close to a feasible product without using neural nets given where technology stands today. There isn't any other technology even remotely close to neural nets at matching their capabilities at object detection and recognition with limited compute and at real-time speeds. Neural nets have their drawbacks, chief amongst them being their black-box nature and their inability to perform higher-level reasoning about a scene like a human. But that's like a first-world problem for a computer-vision algorithm. Computer vision algorithms of yore for recognition and detection could only dream of ever being as good as current-day neural nets. Maybe some day there will be a new class of algorithms that can supplant neural nets by improving on their limitations, but there is nothing out there at the moment that even hints at this possibility.
 
Need to detect STOP signs?... train on thousands of images of STOP signs and then your network will do okay. Need to still detect it when a person is holding one up, or when it is partially occluded by a tree? Well, collect and label a whole bunch of unique instances of those scenarios, and after all that effort you have a slightly more robust STOP sign detector. Oops, now your fantastic STOP sign detector is detecting a STOP sign painted on the back of a van. What do you do now?

The answer, is not train your network not to detect stop-signs on vans. If you do that, you will also train your network not to ping on STOP signs that look very similar to the painted STOP sign in the examples you trained it on as non-stop signs. What you really want, is a "smarter" algorithm that can reason about the world. That understands the scene, that knows that the STOP sign is painted on the back of a van and isn't a "real" stop-sign.

Alternatively, estimate the speed of each object relative to the ground, and if a stop sign is in motion, it isn't a stop sign. You don't need to know whether the stop sign is on the back of a van or not, and if you try that, there's a good chance that you'll fail to detect the stop sign on the side of a school bus. What matters is whether the stop sign is stationary or moving. And this approach can be done entirely with trivial procedural code — no AI needed.
 
Alternatively, estimate the speed of each object relative to the ground, and if a stop sign is in motion, it isn't a stop sign. You don't need to know whether the stop sign is on the back of a van or not, and if you try that, there's a good chance that you'll fail to detect the stop sign on the side of a school bus. What matters is whether the stop sign is stationary or moving. And this approach can be done entirely with trivial procedural code — no AI needed.
Don't forget about hand-waved stop signs, oh and swinging stop signs in high wind.
Also a moving stop sign on a school bus can become a stationary stop sign so don't ignore it altogether.
 
Alternatively, estimate the speed of each object relative to the ground, and if a stop sign is in motion, it isn't a stop sign. You don't need to know whether the stop sign is on the back of a van or not, and if you try that, there's a good chance that you'll fail to detect the stop sign on the side of a school bus. What matters is whether the stop sign is stationary or moving. And this approach can be done entirely with trivial procedural code — no AI needed.
I mean sure, but that's missing the point a bit. All it takes is for that van to be parked by the side of a road and you're back to square one again. Like I said, you can always create a huge nest of if-then-else logic and heuristics to tackle different edge cases, and in some situations that can work well, but ultimately you are still going to be held back by this without another big improvement on the algorithm/reasoning side of things.
 
Don't forget about hand-waved stop signs, oh and swinging stop signs in high wind.

Movement in the direction of travel is the only direction that is realistically detectable with enough accuracy to use it for classification purposes. Motion in any other direction would get false positives because of bumps or turning the wheel slightly.

Hand-held stop signs likely aren't moving fast enough in the direction of travel to register as moving in a pure-camera system. You'd need RADAR or LIDAR for sufficiently speed measurement. The same is true for the wind example. So in a perverse way, the pure vision approach works better by not working as well. :D

Also a moving stop sign on a school bus can become a stationary stop sign so don't ignore it altogether.

If the bus is moving, the stop sign doesn't matter. When it stops moving, it matters. So again, whether it is moving or not is likely adequate.
 
I mean sure, but that's missing the point a bit. All it takes is for that van to be parked by the side of a road and you're back to square one again. Like I said, you can always create a huge nest of if-then-else logic and heuristics to tackle different edge cases, and in some situations that can work well, but ultimately you are still going to be held back by this without another big improvement on the algorithm/reasoning side of things.
And if you paint a green traffic light on the side of a building, the same thing can happen. At some point, you really shouldn't allow people to do things like that, because a human who isn't paying careful enough attention could make the same mistake and become confused.

That said, the ideal solution, which is completely robust against that sort of attack, but requires decent depth measurement. A real traffic light isn't flat, and a real stop sign isn't part of a larger flat surface. But that level of depth accuracy may or may not be practical using pure vision right now.
 
Movement in the direction of travel is the only direction that is realistically detectable with enough accuracy to use it for classification purposes. Motion in any other direction would get false positives because of bumps or turning the wheel slightly.

Hand-held stop signs likely aren't moving fast enough in the direction of travel to register as moving in a pure-camera system. You'd need RADAR or LIDAR for sufficiently speed measurement. The same is true for the wind example. So in a perverse way, the pure vision approach works better by not working as well. :D

If the bus is moving, the stop sign doesn't matter. When it stops moving, it matters. So again, whether it is moving or not is likely adequate.
To a degree. Nothing is absolute. What about a highway flagger sitting on the back of a slow-moving vehicle, or running very quickly towards you waving the stop sign. Is a painted Stop Sign that isn't octagonal any less valid if some local community fair decides to make one?
 
Great discussion.

I think your points make sense - I see these as issues for all autonomous companies.

How much issue do you see with Tesla's 'psuedo-lidar' approach to mimicking Lidar? This seems to be the main "bottleneck" to theoretically allowing them to asymptote to similar performance of competitors.

I believe Tesla's 'pseudo-lidar' approach is what is termed as "monocular depth estimation" in the literature as I believe that they aren't relying on stereo imagery for depth-estimation, but rather, are training on a huge wealth of monocular camera imagery with lidar-based ground-truth to obtain range values associated with each pixel in the scene to train a deep-network to estimate range.

While this approach can work very well if you have a lot of varied data with good ground-truth to train against, the "problem" is that this "measurement" is ultimately not based in any physics grounded in the reality of the real-world. What I mean by that is that Lidar measures distances based on the measured time-of-flight for a laser pulse between when it was emitted and received back after reflection. It enables exceptionally accurate range-maps and these measurements are based on the fundamental physics of electromagnetic transmission/reflection that are indisputable.

On the other hand, pseudo-lidar range "measurements" at the end of the day are just made-up values from a black-box. Most of the times, this black-box is quite accurate, but it is still a black-box making up answers based on what is present in the input imagery. There is nothing that fundamentally grounds it in reality or ensures that the range-estimates for any given pixel will always be accurate. So it suffers from the same challenges as all other deep-learning/neural-network based models and algorithms... namely they can generalize in completely unknown ways on data that falls outside the distribution of the training-data the model was presented with. So if you now have a kid in a huge T-rex costume walking across a road on Halloween, a Lidar system will easily provide an accurate depth-map for this child in a T-rex costume. A 'pseudo-lidar', passive-ranging network on the other hand could make up any sort of arbitrary range values in that case if it wasn't trained against these types of scenarios. So you ultimately have no assurances that it will always get things right. This is where Lidar provides a really huge amount of assurances that a vision-only approach is going to have a lot more difficulty in providing. This is also why Waymo is fundamentally ahead at the moment when it comes to FSD demonstrations imo. Now, if you use stereo vision to help with ranging (some folks do this), then that could certainly help though it still won't be nearly as accurate as Lidar which is pretty much the gold-standard.

I really wish Tesla would pivot and include cheaper lidar hardware to augment their vision-based system. I feel like they could make much faster progress that way, but they've kinda backed themselves into a corner by selling FSD to tons of customers and promising that they will make it work with the sensors they currently have. Yes, in theory, vision-only should be sufficient. But that ignores the fact that vision-only is still a much harder proposition to get working reliably and without glitches on edge-cases that could prove catastrophic. Lidar + vision on the other hand complement each other quite well in many ways and are also a lot easier when it comes to sensor fusion.