Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

AI experts: true full self-driving cars could be decades away because AI is not good enough yet

This site may earn commission on affiliate links.
Alternatively, estimate the speed of each object relative to the ground, and if a stop sign is in motion, it isn't a stop sign. You don't need to know whether the stop sign is on the back of a van or not, and if you try that, there's a good chance that you'll fail to detect the stop sign on the side of a school bus. What matters is whether the stop sign is stationary or moving. And this approach can be done entirely with trivial procedural code — no AI needed.

That would be an improvement for sure and I am astonished that they don't have such basic checks yet.
It will pop back into existence if you both stop though.

  • is it moving along with you?
  • is it likely to be attached to another bigger object classified as vehicle (does it move along with and stop with it)?
  • is it in an unnatural position (very low in front of you)
These simple checks would eliminate most of the issues.

I believe Tesla's 'pseudo-lidar' approach is what is termed as "monocular depth estimation" in the literature as I believe that they aren't relying on stereo imagery for depth-estimation, but rather, are training on a huge wealth of monocular camera imagery with lidar-based ground-truth to obtain range values associated with each pixel in the scene to train a deep-network to estimate range.

It's also called structure from motion SfM and is a old technique.
 
Wall Street Journal has an interesting article today about self-driving cars:


Basically, some AI experts are arguing that AI is not good enough yet for true full self-driving cars. They point out that our best self-driving cars still need some help, like with HD maps and remote operators. So they think we will see limited self-driving cars, like we are seeing now with Waymo and others, but true full self-driving cars, that can drive anywhere with no human assistance, are still decades away. They explain that current AI is good at seeing patterns but is not good at extrapolation:



Other experts, including at Waymo and Aurora, argue that you don't need to "solve AI" in order to have true full self-driving:



-----------

My take: I think the article is right about the current challenges with AI and that we probably won't see true L5 in the short term. However, I think the article is probably wrong about it taking decades to "solve FSD". We tend to underestimate the speed of technological progress. Just look at the how quickly computers have evolved. I think it is very possible that we could see some big AI breakthroughs in say 5 years that help us achieve better FSD sooner than "decades". We might also find clever engineering ways to "solve FSD" without solving AI, as Rajkumar suggests. After all, we've solved a lot of tough engineering problems already without "solving AI". So I think self-driving tech will get better and better and we will see more self-driving cars on the roads in the years to come. I am optimistic that it won't take decades to "solve FSD".

I would also argue that limited L4 self-driving may be good enough for now, at least for the short term. Sure, true generalized L5 self-driving cars, with human-like intelligence, would be the holy grail, but I don't think it is necessary. After all, the goal is to achieve self-driving cars that are safe and reliable and serve a useful application like ride-hailing. Does it really matter how we achieve that goal, as long as we achieve it? If it takes some geofencing, HD maps etc to achieve the goal, so what?
To sum it up:
Experts who are critical, so far have been correct.
Experts who are optimistic so far have been wrong.
Why would we suddenly believe the ones who have been wrong? If Musk said (3 months) today, who would believe him?

I don't believe true fsd will happen until every car on the road communicates with each other. That will be decades.
 
To sum it up:
Experts who are critical, so far have been correct.
Experts who are optimistic so far have been wrong.
Why would we suddenly believe the ones who have been wrong? If Musk said (3 months) today, who would believe him?

Yes, and that is why we should be very skeptical when Elon claims that L5 will happen this year. But we should not lump everybody in the same boat. They are not all making the same claims. Elon is the only one claiming Tesla will have L5 this year. Others, like Waymo and Cruise, are not making those claims. In fact, they are arguing you don't need to solve AI in order to do self-driving. And they are focused squarely on L4, not L5.
 
The entire concept of X2V and V2X is dead on arrival. Anything that requires a signal from a roadside device is just a complete non-starter. If your car takes behavior cues from some sensor network rather than the information presented to it by the scene it is in, then it is susceptible to trivial attack. I can sit on the roadside with a backpack and either jam the signal telling your car to stop, or I can send messages to all of the cars around me to tell them all to stop. I mean, just look at all of the nefarious stuff happening on the internet, and now think if you're willing to put your life in the hands of some network maintained by the lowest bidder?
v2x will have security challenges, but its not that hard, really. we know how to do security (we mostly just choose to ignore our own best practices and also chooose not to PAY for people to do proper security and code audits).

I would bet on v2x tech. I dont know how it will roll out with all the vendors, but the chipsets are coming along, the concept is still moving forward but adoption is slow from vendors, admittedly.

and it will be an evolution. the internet started small and simple and totally without ANY security. we learned and we made it better. took decades, but we started small and made it into what it is tod-

oh hell. maybe it won't work afterall.
 
  • Like
Reactions: JHCCAZ
I believe Tesla's 'pseudo-lidar' approach is what is termed as "monocular depth estimation" in the literature as I believe that they aren't relying on stereo imagery for depth-estimation, but rather, are training on a huge wealth of monocular camera imagery with lidar-based ground-truth to obtain range values associated with each pixel in the scene to train a deep-network to estimate range.

While this approach can work very well if you have a lot of varied data with good ground-truth to train against, the "problem" is that this "measurement" is ultimately not based in any physics grounded in the reality of the real-world. What I mean by that is that Lidar measures distances based on the measured time-of-flight for a laser pulse between when it was emitted and received back after reflection. It enables exceptionally accurate range-maps and these measurements are based on the fundamental physics of electromagnetic transmission/reflection that are indisputable.

On the other hand, pseudo-lidar range "measurements" at the end of the day are just made-up values from a black-box. Most of the times, this black-box is quite accurate, but it is still a black-box making up answers based on what is present in the input imagery. There is nothing that fundamentally grounds it in reality or ensures that the range-estimates for any given pixel will always be accurate. So it suffers from the same challenges as all other deep-learning/neural-network based models and algorithms... namely they can generalize in completely unknown ways on data that falls outside the distribution of the training-data the model was presented with. So if you now have a kid in a huge T-rex costume walking across a road on Halloween, a Lidar system will easily provide an accurate depth-map for this child in a T-rex costume. A 'pseudo-lidar', passive-ranging network on the other hand could make up any sort of arbitrary range values in that case if it wasn't trained against these types of scenarios. So you ultimately have no assurances that it will always get things right. This is where Lidar provides a really huge amount of assurances that a vision-only approach is going to have a lot more difficulty in providing. This is also why Waymo is fundamentally ahead at the moment when it comes to FSD demonstrations imo. Now, if you use stereo vision to help with ranging (some folks do this), then that could certainly help though it still won't be nearly as accurate as Lidar which is pretty much the gold-standard.

I really wish Tesla would pivot and include cheaper lidar hardware to augment their vision-based system. I feel like they could make much faster progress that way, but they've kinda backed themselves into a corner by selling FSD to tons of customers and promising that they will make it work with the sensors they currently have. Yes, in theory, vision-only should be sufficient. But that ignores the fact that vision-only is still a much harder proposition to get working reliably and without glitches on edge-cases that could prove catastrophic. Lidar + vision on the other hand complement each other quite well in many ways and are also a lot easier when it comes to sensor fusion.

Hmm I think you are assuming they are only making an estimate at each point in time independently. I *think* they are including a temporal series of images in a transformer (or CNN) to make the depth estimate. Which, then, is rooted in physics (at least partially).
 
To a degree. Nothing is absolute. What about a highway flagger sitting on the back of a slow-moving vehicle, or running very quickly towards you waving the stop sign.

AVs have to have special rules for determining whether specific types of slow-moving vehicles can be passed anyway, and those rules vary from state to state. For example, it is usually (but not always) legal to pass a city bus stopped in a bus stop, but it is usually illegal to pass a school bus even if it is traveling in the other direction (but it may or may not be legal to do so on a divided street). And in fact, it is usually illegal to go after stopping, i.e. that stop sign is not like a normal stop sign — more like a traffic light.

Not sure what to do about someone running towards you with a stop sign. That's not a situation that should happen (assuming the person wasn't asleep on the job), but dealing with people directing traffic is entirely separate from road sign handling, and has to deal with lots of cases, including people pointing in a specific direction with flashlights, etc. That's likely the sort of situation where the car would have to stop and ask a human to tell it what to do, because I doubt it will be practical to automate it in the foreseeable future.


Is a painted Stop Sign that isn't octagonal any less valid if some local community fair decides to make one?

A stop sign that isn't octagonal isn't a legal stop sign from a highway code perspective, so I would argue that it's not valid at all. But if that is allowed in a particular state, then it's a special case that would have to be explicitly allowed and trained on (probably in a geofenced way).
 
Hmm I think you are assuming they are only making an estimate at each point in time independently. I *think* they are including a temporal series of images in a transformer (or CNN) to make the depth estimate. Which, then, is rooted in physics (at least partially).
Ah, that's possible. I haven't followed the specifics of Tesla's implementations closely as I don't think they publish much of their work. Assuming they are using some sort of RNN like architecture or some other means to distill temporal information and feed it back into the network, that can certainly help it do a better job, but the point in my original post still stands. The network is still not rooted in any physics at all. It is just an approximation machine and it can still break down in completely unpredictable ways on samples that are not similar enough to the training-set distribution. It doesn't for example "learn" classical structure-from-motion algorithms and wouldn't be able to generalize well on data that looks very different even though classical algorithms rooted in physics would.
 
I think with the current tech on hand, and the current state of incremental advances, we would either need all cars to be on some sort of mesh-network for truly L5 autonomous self-driving (highly unlikely that ever happens), or we will need to make our peace with having less capable "self-driving", whether that be being restricted to well-controlled, geofenced areas, or highways, or only regions with up-to-date HD maps, or just always requiring an attentive driver at the wheel. Perhaps we will get to reliable L4 without too many restrictions, but even that seems pretty iffy to me at the moment.
But we do have evidence that we can train computers to play videogames at superhuman levels. Why is driving fundamentally different from playing a video game? Yes, the world is messy and complex, but Tesla is in the unique position of being able to observe its entire fleet of drivers interacting with that messy, unique world. No different from letting an AI watch players play a video game to figure out the rules of the game.

Once the fleet has observed real humans driving behind trucks with stop signs on their back, it can learn from these experiences and react in a humanlike way.

Will the AI still be dumb about circumstances its never learned about before? Absolutely, but those will be rare and hopefully the AI can hand off the driving to a human, either in the car or remote.
 
  • Like
Reactions: mikes_fsd
But we do have evidence that we can train computers to play videogames at superhuman levels. Why is driving fundamentally different from playing a video game? Yes, the world is messy and complex, but Tesla is in the unique position of being able to observe its entire fleet of drivers interacting with that messy, unique world. No different from letting an AI watch players play a video game to figure out the rules of the game.

Once the fleet has observed real humans driving behind trucks with stop signs on their back, it can learn from these experiences and react in a humanlike way.

Will the AI still be dumb about circumstances its never learned about before? Absolutely, but those will be rare and hopefully the AI can hand off the driving to a human, either in the car or remote.

A video game has a constrained set of rules, and the AI has a pixel perfect view, and even input "under the hood".
So it can train on that very limited and specific data for specific scenarios (finishing a level that never changes)

Reality is extremely complex and is ever changing, so it needs to adapt. That's why self driving is hard. You could probably train for a street in your neighborhood or a block , by running the car crashing over and over again till it gets it, given that you don't change the environment too much :D

That's right - in games they fail over and over again. But in reality you will cause damage for millions of dollars and endanger people, sending the car crashing in random spots over and over again till it manages to get through :D It will also be good ONLY for that block
 
When I first moved to Vancouver we had residential streets with four-way intersections with no signs or lines whatsoever, you have to approach each one carefully. I've seen power cuts where the lights go out in a major intersection and people don't stop or slow at all in any direction. Every day here I still encounter pedestrians crossing 6 lanes of heavy moving traffic with no warning or sign of self-preservation. The infrastructure is now mostly well-managed but odd things happen every single drive.

It's to the point in this city that I drive, bike, and walk with the expectation of insane behaviour. I don't know how a NN would cope. Us humans barely do.
 
I believe Tesla's 'pseudo-lidar' approach is what is termed as "monocular depth estimation" in the literature as I believe that they aren't relying on stereo imagery for depth-estimation, but rather, are training on a huge wealth of monocular camera imagery with lidar-based ground-truth to obtain range values associated with each pixel in the scene to train a deep-network to estimate range.

In the forward direction, there are three cameras with a horizontal distance between them, plus overlap between the wide field camera and the B-pillar cameras. So I can't imagine that they aren't doing a true parallax-based depth map.

Most other directions rarely matter. If someone hits you from behind, it's that other person's fault, and if someone broadsides you, it's the fault of whichever person ran the traffic light or stop sign.
 
In the forward direction, there are three cameras with a horizontal distance between them, plus overlap between the wide field camera and the B-pillar cameras. So I can't imagine that they aren't doing a true parallax-based depth map.

Most other directions rarely matter. If someone hits you from behind, it's that other person's fault, and if someone broadsides you, it's the fault of whichever person ran the traffic light or stop sign.
The baseline of the cameras is only about 8cm- close to that of a human. This provides minimal binocular effect at useful driving distances. Even humans really don't do binocular beyond about 10 meters. Plus, none of the cameras are the same- they all have very different FOV's, making binocular processing very hard and also limited by the performance of your worst camera. With a 120 degree FOV, the wide camera can't resolve a 8cm difference more than about 10m out, just like a human. The narrow 35 degree camera can't even see the lane next to you until 10 meters in front of you...

The idea that it isn't your fault when you change lanes on a highway as a car is coming up behind you and require them to slow down is kind of hilarious, and you need the other cameras when backing up, pulling our onto a road, etc.

Tesla pretty clearly believes they can solve autonomy with monocular vision.
 
Last edited:
Some of these responses are pure gold.

pedestrian detection systems. Those were basically horrible in the 90s.

Get this, pedestrian detection systems successfully used radar and sonar for a long time because visible light cameras weren't necessary. So, yeah, camera based pedestrian detection sucked in the 1990s. Because nobody in their right mind would have attempted to use it back then, and even today it's unnecessary.

Till the advent of convolutional neural nets

You are confused on a LOT of points here, not least of which is when CNNs were invented. In case you didn't bother looking it up, CNNs in their modern sense are a product of the 1980s, and the video I provided that mentioned NavLab 1 talked about using CNNs. Like, legitimately I don't know where you're getting these ideas from but history disagrees with your timeline completely.

you don't understand the technology.

Alternatively, there are a lot of people here that think marketing and advertisements convey facts when they're just promotional material. If you drink that brand of beer, people won't find you more or less attractive.

as component blocks

Yes. Very likely this. Neural networks will be used to detect and classify some types of object without a doubt. But certainly they can not detect or classify all because we can't give them an infinite training set.

but its not that hard, really.

laughs in decades of SSL and TLS vulnerabilities

Everytime somebody says "It's as easy as..." or "it's not that hard" a developer breaks a finger. Or whatever the opposite of an angel getting its wings would be.
 
  • Like
Reactions: Matias
You are confused on a LOT of points here, not least of which is when CNNs were invented. In case you didn't bother looking it up, CNNs in their modern sense are a product of the 1980s, and the video I provided that mentioned NavLab 1 talked about using CNNs. Like, legitimately I don't know where you're getting these ideas from but history disagrees with your timeline completely.
Okay, if you are going to be pedantic about it... "advent of modern-day CNNs and deep learning". Yes, I know my history, but it is irrelevant here. CNNs couldn't realize anything close to their potential when they were hamstrung by limited compute. You keep trying to pretend like things were just as capable "back in my day", but they weren't. It doesn't matter if the theoretical underpinnings of something was around or that Yann LeCunn and company proposed CNNs back in the 80s. They just weren't all that useful or capable for real-world applications till much more recently with the widespread availability of GPU compute devices, CUDA acceleration and frameworks for effectively training such models to be successful in a relevant and useful manner.

That doesn't mean that people weren't researching them or trying to put them to use in applications, but look up any review on the state of CNN-based computer vision and you will see a step change in the capabilities of modern CNNs over the last decade starting with networks like VGG16, AlexNet, etc.

Honestly, you keep stating things that are factually inaccurate (eg: MobilEye and Waymo are using "traditional code" for driving) and then go off on random tangents and pretend like they make some sort of a point. It's a very "Old man yells at cloud" vibe and you seem completely disinterested in a good-faith debate. I guess you have your soapbox now and some poor soul here is probably interested in learning more about how SSL and TLS vulnerabilities have anything relevant to contribute to this discussion.
 
but it is irrelevant here.

History is a great predictor of the future. History tells us that revolutionary changes are the result of a slow, steady march. Anybody expecting anything but a slow, steady march is making a mistake. I have seen zero research papers, and zero products in the real world that lead me to believe that a generalized L5 driving system is coming any time soon, or that the current solutions for L2-L4 are going to be applicable.

It's a very "Old man yells at cloud" vibe and you seem completely disinterested in a good-faith debate.

I'm very interested in good faith debate, but you people are repeating the same things you've been saying since 2016. At some point, repeating myself to you became a chore rather than a debate. As I've said, we've been working on this problem with the current strategy since the 1980s. Before then, we were working on this problem with all kinds of other strategies since the 1950s.

Consider this- How long has it taken to improve BEVs and all of their constituent parts? How much have they improved since the 1980s? Now compare that to the promises of nuclear fusion energy, and the decade after decade refrain that we're "just a couple years away". But we aren't even close to a couple years away. Meanwhile, entire new battery materials have gone from concept to research, into lab and finally production and improvement cycles. In less than 80 years, we went from the initial Hewlet Packard computer to the Internet and smart phones. In 60 years, neural networks are still in their infancy- easily tricked, brittle, black boxes that rapidly produce lawsuits as people misunderstand what they're holding and apply it to increasingly complex problem sets. If we can't reliably use "AI" in radiology settings, how are we expecting to use it reliably in uncontrolled environments exactly?

CMU has been the preeminent researcher in this space for over 40 years now and they are the originators of the modern solution to autonomous driving. So, you can and should certainly debate what I've said. But when the most advanced and experienced research facility isn't confident in the solution, and they don't believe autonomy will be solved for decades still, you're going to have to argue with more than PR slides from a company seeking investors. You can all sit in this thread and try to tear down the incredible work that CMU has done, and the leading research they have produced by comparing screen shots of Tesla's rendered UI to their UI from 1995, but that honestly just looks foolish. Their solution back then IS the parent of the solutions being used now. All founded on the same concepts, improved excruciatingly slowly over time. Much slower than any successful technology we've used in modern life.
 
  • Like
Reactions: Sharps97 and Matias
The reason is the moron behind the wheel WHO WAS SOLELY RESPONSIBLE FOR OVERSEEING AP/FSD, wasn't.

What part of that don't you understand? I am so freakin' sick of this kind of bovine excrement that blames AP/FSD but ignores the fact the driver was given clear warnings that he/she is RESPONSIBLE. PERIOD.
Wow, you called me bovine excrement? Is that allowed in this forum? Cute how Internet tough guys appear across all forums.
 
  • Like
Reactions: Matias and DanCar
... dealing with people directing traffic is entirely separate from road sign handling, and has to deal with lots of cases, including people pointing in a specific direction with flashlights, etc. That's likely the sort of situation where the car would have to stop and ask a human to tell it what to do...
This sub-thread, the discussion of unusual stop signs and multiple edge-case examples of humans directing traffic etc, underlines a relevant generalized question:

Does L4/L5 driving require the ability to understand and react to every possible scenario in the way that experienced humans do?

As in:
  • Every odd object appearing in the road (What is it? How bad is it to hit it? Is it worse to hit it than to run off the road or sideswipe the adjacent car or slam the brakes and get hit?)
  • Every odd sign (Is it important? Is it official? Is it a joke or an ad or a prank? How bad is it to guess wrong?)
  • Every interaction with humans through voice or hand signals (Are they talking to me? Are they officially directing or impromptu assisting or just angry? What are they trying to say? Is it someone in trouble, a hitchhiker, a panhandler, a protester, a carjacker?
Probably most everyone here agrees that these judgments, difficult enough for humans, are beyond any AI that can't pass the Turing test 24/7. (If it could, I'd say that self-driving would be the least of its accomplishments.) And indeed some members predict widespread L4+ is at least decades away - based on AI being unlikely to mimic, much less surpass, human judgment in these cases.

Many others here (myself included) think L4 is doable fairly soon, and yet we continually get mired in discussions of challenging edge-case examples - which can never end if AV response must be human-like in nature yet unquestionably super-human in safety performance.

It's a logical trap, and the way out is to change the problem's boundary conditions. That's why I think it's clear that AVs must be given a leg up over past human-only systems, in some aspect(s) of the their operational tasks.

Forgive me if the list is a little off, but it's usually understood that Self-Driving system it's some variant of:
Perception Localization Decision Planning Control
I don't hear too much debate about feasibility of Localization*, more about the relative necessity of detailed pre-mapping. Robotically Planning the trajectory and Control of its execution (once the decision has been made) are still imperfect based on watching Tesla FSD videos, but that has nothing to do with feasibility or compute and machine-design limitations, and in my view need no technical help beyond better-quality driving expertise encoded into the software.

So this leaves Perception and Decision as the most challenging Intelligence-based aspects, and the argument about AI's chance of success is mostly around edge cases there. These are the areas that could really use a leg up over humans, because we're very doubtful that AI is on the trajectory to solve them as well as adaptable humans, especially not to the desired mistake-free level that everyone wants.

Now, what could we do to advantage the machine in a way that could make up for its acknowledged AI deficiencies? Well, what if we could show it not just what's around it, but what's outside the view and/or hard-to-interpret perceptually, what's coming up or what to do right here at these cones?

I'm saying that a key leg up, quite realizable today, is V2X. (I'm certainly not the only one, but there's surprisingly low discussion and a lot of scoffing and that's why I'm posting again.)

Are you not sure in 0.01% of cases that thing there is a stop sign or not, or whether it applies to you at the moment? Well then 99.x% of those remaining uncertainties will be solved because it will tell you itself, and so will the other cars nearby. Not sure you can blaze through that shadowed overpass or tunnel? It can tell you and so can the cars that went through just now. How do I know if there's a little kid hidden between parked cars? Various other cars, including the parked ones, and maybe also the little safety beacon her mom clipped onto her jacket. Not by legal mandate please, but by common sense - the same as why the child is wearing the jacket without legal requirement.

And finally, I reiterate that this will not only serve to mitigate the AV edge-case challenges, but will, over a few years' time, dramatically reduce non-AV accidents as well, because cars whether in AV or manual mode will increasingly be equipped. Even if yours isn't, some and later many and later most will be, and that's huge. X2V on fixed infrastructure would happen even faster and that's also huge.

*By the way, we already give AVs a leg up with GPS and mapping. But it's not controversial simply because its use for driving started before any serious AV deployment. Despite that it's not essential and we can and did drive without it for a century, it's a big help, an existing leg up that humans can use. So AFAIK no one is on here claiming that GPS and computer-mapping technology are unnecessary, imperfect, jammable, hackable or sometimes unavailable, even though it's arguably all of those.​
 
Last edited:
A video game has a constrained set of rules, and the AI has a pixel perfect view, and even input "under the hood".
So it can train on that very limited and specific data for specific scenarios (finishing a level that never changes)

Reality is extremely complex and is ever changing, so it needs to adapt. That's why self driving is hard. You could probably train for a street in your neighborhood or a block , by running the car crashing over and over again till it gets it, given that you don't change the environment too much :D

That's right - in games they fail over and over again. But in reality you will cause damage for millions of dollars and endanger people, sending the car crashing in random spots over and over again till it manages to get through :D It will also be good ONLY for that block

I also want to add:
Yes, Starcraft AI is impressive!

The reason why this work, that I did not make clear, is that the simulation is the same as the "reality". Literally 1:1
I guess that DOJO will do something of the same?