Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Firmware 9 in August will start rolling out full self-driving features!!!

This site may earn commission on affiliate links.
I don't think that it's at all obvious that changing the input to a neural network will work all that easily. There's a large body of research around fooling neural networks with pretty small changes to images that in most cases humans don't notice, e.g. mistaking a turtle for a rifle.

Those images are intentionally made to fool the networks in that way. The researchers analyzed the activation patterns in the network weights and basically reverse-engineered changes to fool the network.
 
Besides, there's not necessarily a need to create a point cloud or build a model. The whole point of using a massive neural network, as I understand it, is that it can learn that the relationship between where objects appear within different cameras' views gives an indication of distance, without needing to actually compute the depth for each pixel (which would be way more detail than is needed anyway).
That's correct. People are confused by the fact we reconstruct things to a 3D image in our brains based on the two eyes (normally) but that's not at all how the neural network works in EAP/PSD. It learns the association of what the different views of the same object means with respect to what it is, how big it is, how far away it is, and where it's going, without doing any kind of "reconstruction" to create a 3D map.
 
You need to know precisely what it is so you can accurately respond to it. Its vital that you don't over react or under react.

I'm just highlighting this because not over-reacting is an aspect of the problem that's routinely under-appreciated. This is really at the core of what it takes to go from an impressive demo to a real live L3+ system. It's not just "can you recognize objects and plan a path?". That's what gets you to a demo. What gets you to the real thing is, in the case of recognition, high precision and high recall (meaning very close to 0% false negatives while also close to 0% false positives -- very difficult to achieve both at the same time). And in the case of path planning, it means being able to navigate unusual situations, such as emergency workers directing traffic, objects falling off the vehicle in front of you, poor traction, etc.

Tesla's current capabilities are in the realm of impressive demo, and very, very far from what I describe above. You can see this in the high rate of phantom braking, which has gotten worse rather than better as they have added capabilities to the system. This phenomenon of getting higher false positives as you try to increase your capabilities is very common and very difficult to deal with. And they're not even working on the truly hard things -- they're still struggling with stopped cars in the road.
 
  • Like
Reactions: OPRCE
But for cameras, speed doesn’t decrease resolution.

Speed determines how far ahead you need to look. If you're going 5m/s you don't need to see more than 10m away because you can stop on a dime. If you're going 35m/s (roughly 80mph/120kph) you need to be looking 100m ahead at bare minimum, preferrably something closer to 200m. (Note the relationship between speed and braking distance is non-linear because energy is proportional to square of velocity.)

The whole point of using a massive neural network, as I understand it, is that it can learn that the relationship between where objects appear within different cameras' views gives an indication of distance, without needing to actually compute the depth for each pixel (which would be way more detail than is needed anyway).

The problem with deep learning used this way is that they will only learn things in your data. If they encounter something not present in your data, the way they react will be essentially random, or at least not nearly guaranteed to be safe. Practical L4+ systems must have some kind of depth map independent of object classification to be able to avoid hitting objects that either (a) look nothing like anything in their training set, or worse yet (b) like things in their training set from certain angles but are actually very different in scale or 3d shape, or even (c) things in their training set which appear in a novel context (novel w.r.t. the training set -- context could mean being at a strange angle or in a strange place, imagine a fire hydrant lying on its side in the middle of the road rather than being along the side of the road). All of these things can lead to very unpredictable behavior from a deep learning system. This is the biggest challenge of end-to-end deep learning approaches.

On the other hand, if you have a depth map from lidar or stereo, you can build a system that says "I have no idea what that is, but I know exactly where it is and I'm not going to hit it". This handles the cases where deep learning classification fails.

If that's wrong, then I'm pretty sure Tesla's FSD plan is pretty much doomed.

Yeah, exactly.
 
The technique, yes. But real-time monocular SLAM has never been used in anything even approaching self-driving cars by anyone. The technique is still very experimental for that purpose, and it also relies on things like having a huge database of recognized objects to guess how rigid objects are, which makes it presumptively unsafe in the context of self-driving cars that have to make the right decision every time, instantly.

Doing it that way would be beyond crazy.


Everything is experimental with self-driving cars. If you see the EAP videos of the camera feed with bounding boxes, the technique they use works pretty good. And they use deep learning.

As a counterexample, consider the dual-camera setup on an iPhone. That uses two cameras with different focal length to generate a depth map every time you take a picture. There's no requirement that stereoscopic cameras have similar focal lengths.

It is, indeed, computationally expensive, but remember that you don't need a complete, precise map of the entire scene — only a rough approximation, and only in areas of interest (read "things that are not obviously the road surface"). Besides, there's not necessarily a need to create a point cloud or build a model. The whole point of using a massive neural network, as I understand it, is that it can learn that the relationship between where objects appear within different cameras' views gives an indication of distance, without needing to actually compute the depth for each pixel (which would be way more detail than is needed anyway).

If that's wrong, then I'm pretty sure Tesla's FSD plan is pretty much doomed.

So... then you don't really need stereo cameras. Would be nice with extra cameras for redundancy though.
 

So... then you don't really need stereo cameras. Would be nice with extra cameras for redundancy though.


Strictly, you are correct. But in practice, if you don't have images from multiple angles at all times, the neural network would need some sort of additional hardware to give it depth information, either in the form of hardware to compare subsequent frames in a computationally expensive way or in the form of LIDAR.
 
Strictly, you are correct. But in practice, if you don't have images from multiple angles at all times, the neural network would need some sort of additional hardware to give it depth information, either in the form of hardware to compare subsequent frames in a computationally expensive way or in the form of LIDAR.

You compare the following frames. You get the features from each frame and match the features and then build the 3D projection.

 
You compare the following frames. You get the features from each frame and match the features and then build the 3D projection.

This only works if the objects you're measuring distance to are not themselves moving. If they're moving, and you don't know their exact trajectory already, you can't separate your motion from theirs. You can't even say for sure whether they're moving or not.

Seriously, this stuff is hard enough as it is. Why handicap yourself by trying to do without the best tools available?
 
This only works if the objects you're measuring distance to are not themselves moving. If they're moving, and you don't know their exact trajectory already, you can't separate your motion from theirs. You can't even say for sure whether they're moving or not.

Seriously, this stuff is hard enough as it is. Why handicap yourself by trying to do without the best tools available?

It knows. read up on the subject!
It works in current software, don't you think? This is how the EAP and FSD works.
 
No EAP works because of radar as a second validation, as proven when a car swerved out of the lane for a stationary object well ram right into the back of it because the radar is blind to it. And what FSD? ;)

Hehe no, that's not how it works. How does it see cars moving and classify what they are coming alongside outside the ultrasonic sensor's range? There are no radars on the sides and behind the car ;)

 
Hehe no, that's not how it works. How does it see cars moving and classify what they are coming alongside outside the ultrasonic sensor's range? There are no radars on the sides and behind the car ;)


Seeing something, and reacting to it and "trusting" it are not the same thing to me. Yes Tesla is on the path clearly, but they clearly don't trust it, and it's certainly not 100%, watch for the shopping cart footage and other missed things. There's gotta be a reason, they still disable ALC on local roads when AP1 has been doing that for years. EAP really only let's your move forward, you still need to trigger or left / right move, so we still don't have the E in EAP as far as I'm concerned. Oh, and if there cameras are so good and they trust it so much, how come if it snows and the radar gets caked you get zero AP functionality? It's because they need that radar as a second source to "drive" with to me.
 
  • Disagree
  • Like
Reactions: 1375mlm and emmz0r
It knows. read up on the subject!
It works in current software, don't you think? This is how the EAP and FSD works.

It does not "know". Advanced systems can sort of infer relative motion given the context of the global scene, but this is always less reliable than direct distance measurement from lidar or simultaneous stereo. You can't simply create information from scratch, you can only guess at it. Your guesses might be very good in many cases, but having the information directly is both more reliable and less computationally intensive.

As for how the current software works, I can see very well that it has great difficulty estimating the exact 3d position of neighboring cars. They're trying to do this and not doing nearly well enough. For example, a lot of TACC brake checks that I've experienced in V9 result from a car slightly ahead of you in a neighboring lane suddenly being determined to be in your lane because it is getting the 3d position wrong -- some combination of inaccurate bounding box and inaccurate distance guess. I suspect their distance guesses are a combination of apparent size and temporal stereo from consecutive frames, probably just by throwing everything into the NN and letting it guess from whatever information was found to be most useful in the training data, meaning it will take both size and inter-frame disparity into account, plus all sorts of context that you can't even put your finger on and may actually be unhelpful (e.g., lighting conditions or the color of the car), because that's how deep learning works.

Note that both 2d bounding boxes and depth information are always guesses in a system like this. Even if they get to pixel-level labeling with masks instead of bounding boxes, it's still a guess and will sometimes be wrong. (Note that masks would also require a lot more compute power -- presumably a lot of HW3's extra power will be used for this rather than increasing frame rate.)

With more time and particularly with more computing power I think their guesses will get better over time, but direct measurement of 3D extents from lidar or stereo would clearly be superior. Tesla has handicapped themselves by not including any kind of practical direct rangefinding. (And obviously ultrasonic is a joke at highway speeds, even if you call it "sonar" and brag about its 360-deg coverage, so that does not help them much.)

Even better is multiple sensing modalities acting together, like camera + stereo + lidar + radar (+ massively more compute power), like the big boys have. Failure modes of one modality are balanced by the others. Tesla does not care to do this for real, though, they want to sell sexy cars to consumers right now, and using a better sensor suite would kill their chances of doing that in any kind of volume. (Less sexy + more expensive = business fail in the short term.)

Edit: Note also on the subject of inter-frame disparity (temporal stereo) -- the most important objects to estimate distance to very accurately are neighboring cars, which will generally be moving very close to your speed and so the inter-frame disparity will basically just be noise. (During my normal commute conditions adjacent lanes are often moving at nearly identical speeds When there are speed differences, the difference in speed relative to your speed is very small, which gives temporal stereo a really hard time.) They really only have apparent size to go on in this case, and I think that's why TACC in V9 brake checks a lot more frequently -- you can see it in the dancing cars on the display.
 
It does not "know". Advanced systems can sort of infer relative motion given the context of the global scene, but this is always less reliable than direct distance measurement from lidar or simultaneous stereo. You can't simply create information from scratch, you can only guess at it. Your guesses might be very good in many cases, but having the information directly is both more reliable and less computationally intensive.

As for how the current software works, I can see very well that it has great difficulty estimating the exact 3d position of neighboring cars. They're trying to do this and not doing nearly well enough. For example, a lot of TACC brake checks that I've experienced in V9 result from a car slightly ahead of you in a neighboring lane suddenly being determined to be in your lane because it is getting the 3d position wrong -- some combination of inaccurate bounding box and inaccurate distance guess. I suspect their distance guesses are a combination of apparent size and temporal stereo from consecutive frames, probably just by throwing everything into the NN and letting it guess from whatever information was found to be most useful in the training data, meaning it will take both size and inter-frame disparity into account, plus all sorts of context that you can't even put your finger on and may actually be unhelpful (e.g., lighting conditions or the color of the car), because that's how deep learning works.

Note that both 2d bounding boxes and depth information are always guesses in a system like this. Even if they get to pixel-level labeling with masks instead of bounding boxes, it's still a guess and will sometimes be wrong. (Note that masks would also require a lot more compute power -- presumably a lot of HW3's extra power will be used for this rather than increasing frame rate.)

With more time and particularly with more computing power I think their guesses will get better over time, but direct measurement of 3D extents from lidar or stereo would clearly be superior. Tesla has handicapped themselves by not including any kind of practical direct rangefinding. (And obviously ultrasonic is a joke at highway speeds, even if you call it "sonar" and brag about its 360-deg coverage, so that does not help them much.)

Even better is multiple sensing modalities acting together, like camera + stereo + lidar + radar (+ massively more compute power), like the big boys have. Failure modes of one modality are balanced by the others. Tesla does not care to do this for real, though, they want to sell sexy cars to consumers right now, and using a better sensor suite would kill their chances of doing that in any kind of volume. (Less sexy + more expensive = business fail in the short term.)

You don't need to be on the millimeter either. Or else we couldn't drive cars ourselves. If we humans worked like a LIDAR, our heads would spin above the sunroof 360 degrees every second :D Imagine how that would look.

That "jerkyness" comes from the lack of adequate filtering, and will get better.
 
  • Like
Reactions: 1375mlm and MarkS22
You don't need to be on the millimeter either. Or else we couldn't drive cars ourselves. If we humans worked like a LIDAR, our heads would spin above the sunroof 360 degrees every second :D Imagine how that would look.

That "jerkyness" comes from the lack of adequate filtering, and will get better.

Saying that I claim that you need millimeter accuracy (which not even lidar gives you in the real world) is a strawman argument. I never said that. You need to know whether a car is in your lane (or rapidly moving toward your lane) or not. V9 cannot do this reliably, a fact my car reminds me of at least once every day during my daily commute.

I never said anything about how humans work. Another strawman argument. Humans work very differently than Teslas and have vastly more computing power available, and a much more sophisticated, refined visual system and reasoning capability.

When you add filtering to address "jerkyness" you inevitably reduce reaction time. This is the nature of filtering. Sometimes the rapid changes are real, and if "side collision avoidance" is a real feature (it's not) then they cannot filter aggressively because they need that low reaction time.

Note that from my experience, this is exactly what AP is currently doing -- filtering the heck out of everything other than radar. It reacts quickly to radar and very very slowly to any information coming from the cameras. This is not going to be an easy problem for them to fix -- I think they're already filtering more than is wise leading to frequent takeovers due to slow reaction time to, e.g., merging vehicles. If this were an L3 system it would not be able to rely on the driver for rapid takeover and therefore would need to do much less aggressive filtering.

Turns out FSD is harder than driver assistance. Who knew?
 
  • Like
Reactions: kavyboy
Saying that I claim that you need millimeter accuracy (which not even lidar gives you in the real world) is a strawman argument. I never said that. You need to know whether a car is in your lane (or rapidly moving toward your lane) or not. V9 cannot do this reliably, a fact my car reminds me of at least once every day during my daily commute.

Not strawman, just a figure of speech.

I never said anything about how humans work. Another strawman argument. Humans work very differently than Teslas and have vastly more computing power available, and a much more sophisticated, refined visual system and reasoning capability.

Well, you dragged LIDAR into the discussion, and the philosphy behind Teslas design is that "a human can see, so a cameras should be enough", roughly speaking.

When you add filtering to address "jerkyness" you inevitably reduce reaction time. This is the nature of filtering. Sometimes the rapid changes are real, and if "side collision avoidance" is a real feature (it's not) then they cannot filter aggressively because they need that low reaction time.

C'est la vie, if it's physically impossible to react. You need filtering. I think it's possible to both have good enough reaction and adequate filtering.

Note that from my experience, this is exactly what AP is currently doing -- filtering the heck out of everything other than radar. It reacts quickly to radar and very very slowly to any information coming from the cameras. This is not going to be an easy problem for them to fix -- I think they're already filtering more than is wise leading to frequent takeovers due to slow reaction time to, e.g., merging vehicles. If this were an L3 system it would not be able to rely on the driver for rapid takeover and therefore would need to do much less aggressive filtering.

Turns out FSD is harder than driver assistance. Who knew?

That might be the case. We'll see what they do, but I wouldn't dismiss the entire scheme.
 
  • Like
Reactions: 1375mlm
That might be the case. We'll see what they do, but I wouldn't dismiss the entire scheme.

The cameras-only schema will work eventually. Humans are proof that it can work. But it won't work in this decade. I'm speculating wildly, but next decade there's a good chance of it working, particularly toward the end of next decade. But it will inevitably come after the non-handicapped approaches based on better sensor suites and more compute power are released. If Tesla insists on sticking to a camera-based system, they will be very late to the L3/L4 game. They will no longer be relevant as there will already be fleets of more reliable L4 vehicles from other manufacturers operating in major markets. Remember that they can do everything Tesla can do -- temporal stereo, structure from motion, optical flow, whatever -- and a lot of things Tesla cannot. (They are not likely to bother with temporal stereo or optical flow since they have simultaneous stereo and lidar.)

My general point is that this decision (no lidar, only forward radar, rather poor camera suite with small lenses and non-ideal positioning, no stereo) is driven primarily by Tesla's short-term constraints: The vehicle must be both sexy (no bulky lidar or large camera lenses) and inexpensive to produce. It is not driven by the best engineering approach to solving L4 autonomy problems. The best engineering approach, absent these constraints, is clearly as many sensors and as much compute power as you can squeeze onto the car -- several lidars, many radars, not just one but several stereo pairs, nice cameras with large lenses that capture a lot of light with low distortion (incl. thermally-induced distortion, and drift over time), etc.

There is simply no way Tesla could possibly do all that and still sell cars right now. And the consequences of the decision to move forward with this handicap are very clear -- EAP still incomplete and no FSD at all 2 years after they started selling these features. "FSD" now looks likely to be rebranded as an L2+ system or maybe a very limited highway L3 system (after the HW3 upgrade) and L2 on local roads. Certainly not L5.
 
  • Like
Reactions: OPRCE and emmz0r