Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
I didn't just find it interesting; it's also valuable. What's most important from my view was the potential of aknetv9 and how its potential can actually come to fruition with respect to Tesla's EAP/FSD promised features. For some time now I've been convinced that Tesla didn't truly have a path forward to realize those capabilities and were going to caputulate (especially after removing FSD from the purchasable options of new cars). Your analysis actually tells quite a different and encouraging story
 
These two pairs are likely time-offset by some small delay - 10ms to 100ms I’d guess - allowing each processed camera input to see motion. Motion can give you depth, separate objects from the background, help identify objects, predict object trajectories, and provide information about the vehicle’s own motion. It's a pretty fundamental improvement to the basic perceptions of the system.

Sorry for the lateness of the question, just getting caught up with this thread.

I have seen it theorized elsewhere that a sampling such as this could be used for stereoscopic vision. I find this extremely dangerous and having a high likelihood of failure. The part here that interests me is the inference of motion.

I am no neural network specialist but it seems to me that using such an approach would create a needlessly bloated CNN. Why would this be used instead of a post-CNN RNN for the time series component? The only thing I can think of is ease of training at the expense of compute, both training and inference. It certainly would not be the first time compute is sacrificed for quicker results.
 
  • Disagree
Reactions: Roy_H
Sorry for the lateness of the question, just getting caught up with this thread.

I have seen it theorized elsewhere that a sampling such as this could be used for stereoscopic vision. I find this extremely dangerous and having a high likelihood of failure. The part here that interests me is the inference of motion.

I am no neural network specialist but it seems to me that using such an approach would create a needlessly bloated CNN. Why would this be used instead of a post-CNN RNN for the time series component? The only thing I can think of is ease of training at the expense of compute, both training and inference. It certainly would not be the first time compute is sacrificed for quicker results.

Well, stereoscopic vision would require two cameras set up for it, which Tesla has none. Maybe if you were real clever you could use two of the front cams, cropping the wider angle of the two, adjusting for lens geometry differences etc, and try to stereoscope that, but ... yeah... Clearly that's not what is happening if they're running two frames on every camera...
 
  • Like
  • Disagree
Reactions: Roy_H and Tezlanian
Sorry for the lateness of the question, just getting caught up with this thread.

I have seen it theorized elsewhere that a sampling such as this could be used for stereoscopic vision. I find this extremely dangerous and having a high likelihood of failure. The part here that interests me is the inference of motion.

I am no neural network specialist but it seems to me that using such an approach would create a needlessly bloated CNN. Why would this be used instead of a post-CNN RNN for the time series component? The only thing I can think of is ease of training at the expense of compute, both training and inference. It certainly would not be the first time compute is sacrificed for quicker results.

The rationale for processing two frames in the first level vision neural network is based on much more than just stereoscopic vision. Based on @jimmy_d 's excellent podcast interview, big benefits are improved object recognition (determining what the object is), segmentation (determining which part of the 2D image is part of the object), and bounding box estimation (3D dimensions of object). As an analogy, a camouflaged animal can be nearly impossible for a human to see while stationary, let alone identify and draw a box around. But the same animal is very easy to see when moving relative to the background. Because this information helps identify the object in the first place, it has to be incorporated in the first processing step (vision neural net). Once you are capturing two frames in your neural net, of course you might as well also extract relative velocity information in the x,y plane (right, left, up, down), and possibly some rough velocity in the z axis (towards or away from you, based on change in apparent size). If these vision performance boosts are significant, then the increase in neural net size and processing requirements would not be "needless bloat".

Finally, getting to binocular vision, I agree with your point that it's dangerous and unreliable to rely just on relative motion for depth estimation. But that's exactly why this estimation is most useful when done as part of a neural network that is incorporating an array of visual information (and maybe other things like vehicle velocity and steering wheel angle). It appears that Tesla's vision system was already estimating distance based on a single frame, the same way we could estimate the distance of an object in a photo. If the training dataset is sufficient, adding motion (two frames) certainly shouldn't degrade this distance estimation, and should significantly improve it in certain situations. The clearest example would be stationary objects that are not directly in front of the vehicle when the vehicle is moving. Relative movement between foreground and background objects should provide very good distance estimates.

Well, stereoscopic vision would require two cameras set up for it, which Tesla has none. Maybe if you were real clever you could use two of the front cams, cropping the wider angle of the two, adjusting for lens geometry differences etc, and try to stereoscope that, but ... yeah... Clearly that's not what is happening if they're running two frames on every camera...

I think your description of how Tesla could get stereoscopic vision from the main and narrow cameras is exactly what they're doing! I would guess that Tesla is using forward binocular vision right now in their AP2 hardware, although it may be at a level after the visual neural net. If an object identified by the narrow camera is also identified by the main camera, it is relatively straightforward math to estimate distance knowing the separation between cameras without needing a neural net, as long as the cameras are properly calibrated to get precise angular direction from location in the 2d image. If they are also getting distance from movement (2 frames on every camera), this wouldn't replace binocular forward depth perception. This should complement it quite well! The stereoscopic forward vision you describe would work only in the field of the narrow forward camera. Whereas stereoscopic depth from motion works best off the axis of vehicle motion (i.e. everywhere except directly forward and back).
 
Just realized something. Unless you are running a really new VR rig, all video games are played without the benefit of stereoscopic vision. The TV/ monitor is flat, and all cues are interpreted via perspective and occlusion...

I wonder if the increase in graphics power/ resolution allowed the migration from 3rd person 2-D side scrollers/ top down view to 1st person shooters.
 
  • Helpful
Reactions: Roy_H
It could be just the fit and finish of the computational hardware or some of the brackets and trim instructions necessary to replace it, since this is going to be a properly instructed mass-replacement effort that should be given the best chance of good results on a mass scale. The fine level of detail needed for such instructions can often have more refined information per word than there are words. (One would hope these aren't Ikea instructions written by translators, marketing, lawyers, corporate commitee, and ESL students, in backwards sequential order and thus also out of date, wrong, incomprehensible, incorrect, meaningless, fake, and confusing.)
It could also just be an "instrumented" regular car that is used to determine if electrical power consumption, thermal cooling, etc... all works as expected. "Test rigs" are built all the time to see if a component works when integrated in to other systems...
 
Sorry for the lateness of the question, just getting caught up with this thread.

I have seen it theorized elsewhere that a sampling such as this could be used for stereoscopic vision. I find this extremely dangerous and having a high likelihood of failure. The part here that interests me is the inference of motion.

I am no neural network specialist but it seems to me that using such an approach would create a needlessly bloated CNN. Why would this be used instead of a post-CNN RNN for the time series component? The only thing I can think of is ease of training at the expense of compute, both training and inference. It certainly would not be the first time compute is sacrificed for quicker results.

Of course with nothing more to go on than the vehicle's overall behavior and some architectural details it's hard to know exactly what advantages dual frame inputs are providing to the system.

Neural networks are full of surprises right now. They work unreasonably well on some things and then proceed to fail for obscure reasons on other things. A lot of development today is trial, discovery, and optimization with theory being quite weak. That means it's possible to look at a novel system and, having no knowledge of the experience which led the development team to it, completely misunderstand the point of a particular feature.

That said - aknet_v9 does include dual frame inputs for each camera. Each of these frames is a complete output from the camera - these are not color or dynamic range variants. Could one of the frames be the result of some kind of algorithmic processing on the camera output that provides a difficult to discover (for a CNN) functional transformation on the camera output? That's not an unreasonable speculation, but it's very hard to come up with a good candidate for that which would make sense to convolve against the pixels of the camera input (which is what the low levels of the aknet_v9 CNN are doing). By far the most likely kind of frame pair which would make sense to process via pixel convolution is a pair of frames with a small time offset or a small FOV offset. Since these pairs come from the same fixed-lens single sensor camera it can't be the latter, so I'm going with the former. That's the basis of my speculation.

But it makes a lot of sense to go that route because there's a lot of value to be had from a time-offset pair if your NN has enough capacity (and aknet_v9 certainly seems to have a lot of capacity). The accuracy of all the existing outputs should improve in a meaningful way, and new outputs become possible as well - things like instantaneous relative motion. These new outputs are exactly the kind of things which are very useful for determining state in a dynamic environment.

As well, by bringing some of the motion sensing capabilities down to the lowest levels of the perception system it should become possible to reduce the latency for decision making since the need for slower higher level processing to accumulate data over time is reduced.

A human can respond almost instantly to motion detected in peripheral vision because we have motion sensing embedded directly into the retinas of our eyes. There is no need for a signal to spend hundreds of milliseconds propagating all the way up the brain's visual hierarchy for us to perceive that an object is moving relative to other visually adjacent objects. It may be that aknet_v9 is bringing a similar capability to autopilot.
 
Also important to understand which audience Jimmy must know he was speaking to - layfolks and teslafans. Just listen to the interviewer’s questions/comments, like, this guy didn’t even know what Jimmy was talking about when he brought up @DamianXVI and @verygreen ’s videos. (Not holding it against him, I just think it’s telling of what audience Jimmy had for his talk.)

Jimmy deserves huge cred for daring to take the challenge of letting himself be interviewed in this context. I think he did an awesome job, especially considering he’s one dude and not some billion dollar company executive with his own PR-staff

You're making me blush here...
 
Each of these frames is a complete output from the camera - these are not color or dynamic range variants. Could one of the frames be the result of some kind of algorithmic processing on the camera output that provides a difficult to discover (for a CNN) functional transformation on the camera output?

Based on the fact that there are several generations of camera in AP2, it wouldn't surprise me if there is an intermediate step to "normalize" the feed for the NN, rather than sending it a raw feed.

Having a second frame with transformation data in it would mean the NN would have to be trained in two very different skills... so I am wondering if having the NN process in pairs could be a technique to increase confidence rather than for spacial positioning? As you suggest, it must be easier to determine the spacial data outside of the NN. But, with Elon's famous "first principles" approach, nothing is off the table...!
 
@jimmy_d

Thank you for your analysis. I just listened to your podcast with Rob (Tesla Daily) and have been re-reading this thread. Your easy to understand explanations have begun (baby steps) to peak my interest and want to dive in more deeply to understand more.

Can you roughly compare the capabilities of AKNET_V9 vs humans?
How do the two compare as far as:
1. amount of raw visual input, 9+ sensors vs two eyes (or even just one!)
2. the level of recognition/labeling
3. decision making capability
4. failure rate

I would imagine:
- the machine is way ahead for 1. vision and 2. recognition, by 20x.
- decision making capability is roughly equal for the next few years (in the limited L2 autonomy cases).
- the machine has more (2x?) success, primarily because it stays alert 100% of time (DDD).

The fact that folks can drive through fog on a snowy road at night with one eye is just astounding.
On the other hand, 40K auto fatalities per year rests primarily on us.
The road to Level 5 autonomy, over the next thirty years, is going to be very fun to witness!

That's a good question and I agree with all of your statements. I wish I had a good answer for you. Instead here's my bad answer:

I don't have failure rate numbers but you can get useful information by driving the car and getting experience with how it behaves. What you learn won't be numbers but it will probably be more useful to you than numbers if what you want to understand is, how good is AP and is it getting better. You'll find things that are impressively good and you'll find stuff that it can't do well.

You can also get some sense of the camera networks themselves produce by observing @verygreen 's excellent video:


In the video you get some sense of labels, continuity, error types and so forth.

There are obvious parallels between what the AP2 NN does and what a human does when driving a car. And because the AP2 NN is being built and trained by humans a lot of human intuition about how driving happens will be in the system. For instance, humans understand the world in terms of objects and motion over time so we build those notions into what we ask the NN to do by training it to identify objects and segment the world into different categories of stuff and then doing it repeatedly to see how things are changing. That makes it easier for us to understand what the NN is doing when it comes time to debug it, or to enhance it.

But the fundamental way that an NN processes information differs in really important ways from the way that a human does. That makes comparison really difficult except in vague terms that might not actually tell you which one is a better driver. There are going to be ways that the car just wins. For instance, the car can look in all directions simultaneously so merging into complex traffic flows is eventually going to be a lot easier for the car than it is for a human who has to move their attention between multiple important elements. But there are ways the humans are going to keep having the advantage for a long time - humans have a sophisticated ability to predict the behavior of other humans in unusual situations, which NNs are probably not going to do well for quite a while yet.

Even doing something like comparing raw processing capability is pretty hard because the things aren't convertible between biological brains and NN processors. Machines are fast but simple so complex tasks have to be broken down into sequences of processing steps. Biological brains are slow but maintain enormous capabilities on standby, in parallel, all the time. And of course, machines don't get distracted and their behavior is more consistent than humans. All of these factors affect the complicated task of driving a car in different ways.

At the end of the day we will probably have to pick some really general metric - like economic value or number of accidents - in order to have a comparison. Anything else will depend on who is talking and not on what is actually true.

Tesla's published numbers say that cars using AP get into substantially fewer accidents than cars that are not using AP. I think that probably means that in the situations where it's being used today human plus AP is better than human without AP. That's all the meaningful hard data that we have. Since AP can't drive without a human right now we can't compare AP without human to human without AP.
 
I think your description of how Tesla could get stereoscopic vision from the main and narrow cameras is exactly what they're doing! I would guess that Tesla is using forward binocular vision right now in their AP2 hardware, although it may be at a level after the visual neural net. If an object identified by the narrow camera is also identified by the main camera, it is relatively straightforward math to estimate distance knowing the separation between cameras without needing a neural net, as long as the cameras are properly calibrated to get precise angular direction from location in the 2d image. If they are also getting distance from movement (2 frames on every camera), this wouldn't replace binocular forward depth perception. This should complement it quite well! The stereoscopic forward vision you describe would work only in the field of the narrow forward camera. Whereas stereoscopic depth from motion works best off the axis of vehicle motion (i.e. everywhere except directly forward and back).

They already need to derive depth from mono vision (they're doing this by looking at multiple frames over time), for the other cameras, so why implement and train two different methods? From what jimmy_d has described nothing different is happening for the front cams vs side cams, and the two frame inputs are always the same camera. So clearly, they are not doing stereoscopic vision.
 
  • Disagree
Reactions: Roy_H
They already need to derive depth from mono vision (they're doing this by looking at multiple frames over time), for the other cameras, so why implement and train two different methods? From what jimmy_d has described nothing different is happening for the front cams vs side cams, and the two frame inputs are always the same camera. So clearly, they are not doing stereoscopic vision.

I think that stereoscopic vision can add more precision and/or redundancy to depth in the forward direction, where it is especially critical. Unless a single neural network is combining all camera inputs at once, they are probably adding this stereoscopic depth information after the visual neural net processing using the processed outputs of main and narrow neural nets. This is what they are already doing with radar. You could make the same argument: "they already get distance and speed from vision, so they must not be using radar for the same thing." (or vice versa). But they seem to be using both radar and vision in tandem to make decisions, for example about braking for stopped cars. When two, or three systems agree, the car can act with more confidence. If single camera vision and radar disagree, a third depth estimate could help resolve the ambiguity, reducing the false positives/negatives that cause either shadow braking or failure to brake for stopped cars. Their use of stereo vision is still a matter of speculation on my part, but I think it is plausible, if not likely, that they are using this information in some way. In addition, as I tried (and perhaps failed) to articulate in my previous post, depth from relative movement is inaccurate in the forward direction, and inaccurate (or impossible) when the car is stopped. Seems like it would be great to have forward stereo vision at some level of processing for both reasons.
 
Elon Musk: The Recode interview

it's very interesting that Musk is still so confident into the "FSD in 2019" thing:

Which one of them, do you think, is the furthest ahead or closest to you all?

Self-driving, maybe Google, Waymo? I don’t think anyone is close to Tesla in terms of achieving a general solution for working on —

Overall solution.


Yeah. Yeah. You can definitely make things work like in one particular city or something like that by special-casing it, but in order to work, you know, all around the world in all these different countries where there’s, like, different road signs, different traffic behavior, there’s like every weird corner case you can imagine. You really have to have a generalized solution. And best to my knowledge, no one has a good generalized solution except ... and I think no one is likely to achieve a generalized solution to self-driving before Tesla. I could be surprised, but...

So none of the car companies. None of the car companies.

No.

Do you ever look and go, “Okay, that’s interesting what they’re doing there.”

The other car companies ... I don’t wanna sound overconfident, but I would be very surprised if any of the car companies exceeded Tesla in self-driving, in getting to full self-driving.

You know, I think we’ll get to full self-driving next year. As a generalized solution, I think. But that’s a ... Like, we’re on track to do that next year. So I don’t know. I don’t think anyone else is on track to do it next year.
 
Elon Musk: The Recode interview

it's very interesting that Musk is still so confident into the "FSD in 2019" thing:

Which one of them, do you think, is the furthest ahead or closest to you all?

Self-driving, maybe Google, Waymo? I don’t think anyone is close to Tesla in terms of achieving a general solution for working on —

Overall solution.


Yeah. Yeah. You can definitely make things work like in one particular city or something like that by special-casing it, but in order to work, you know, all around the world in all these different countries where there’s, like, different road signs, different traffic behavior, there’s like every weird corner case you can imagine. You really have to have a generalized solution. And best to my knowledge, no one has a good generalized solution except ... and I think no one is likely to achieve a generalized solution to self-driving before Tesla. I could be surprised, but...

So none of the car companies. None of the car companies.

No.

Do you ever look and go, “Okay, that’s interesting what they’re doing there.”

The other car companies ... I don’t wanna sound overconfident, but I would be very surprised if any of the car companies exceeded Tesla in self-driving, in getting to full self-driving.

You know, I think we’ll get to full self-driving next year. As a generalized solution, I think. But that’s a ... Like, we’re on track to do that next year. So I don’t know. I don’t think anyone else is on track to do it next year.

I was super excited to read this stuff. Of course Elon always sounds confident, even when he's talking about really hard deadlines. And people have been wrong about self driving cars for a long time so there's plenty of precedent for being overconfident. Still, I'm really happy to hear this level of confidence about this level of capability becoming possible next year. That might mean we consumers get it a year later but it also says that five years away is probably really pessimistic.

I was thinking recently that, according to the 2018 2Q conf call Tesla has been driving HW3 for a while already: 6 months, maybe more. And before they had the NN chip prototypes they probably were building HW3-equivalent mockups for the cars to test big NNs in addition to big NNs in simulated driving - you need to do something like that just to inform what you put into the chip. So that might have been happening 2 years ago or longer. In other words, Tesla has had a good idea of what HW3 with a much bigger NN would be able to do for quite a while now. But they can't ship it until HW3 is available - which is a multi year effort that won't come to fruition for another 6 months yet.

If they've been sitting on this knowledge for 2 years, and if the results look really good - well that could explain a lot of the statements and actions of the company WRT FSD.

There's also this:

Yeah, I mean ... you need a specialized inference engine. Like the Tesla hardware 3 Autopilot computer, that will start rolling into production early next year, is 10 times better than the next best system out there at the same price, volume and power consumption. And it’s really because it’s got a dedicated neural net chip. Which basically, it sounds complicated, but it’s really like a matrix multiplier with a local memory.

This description is a perfect match for the "TPUv1 style systolic matrix multiplier coprocessor" approach. I still think that's probably what's going into Tesla's NN chip.
 
Also important to understand which audience Jimmy must know he was speaking to - layfolks and teslafans. Just listen to the interviewer’s questions/comments, like, this guy didn’t even know what Jimmy was talking about when he brought up @DamianXVI and @verygreen ’s videos. (Not holding it against him, I just think it’s telling of what audience Jimmy had for his talk.)

Jimmy deserves huge cred for daring to take the challenge of letting himself be interviewed in this context. I think he did an awesome job, especially considering he’s one dude and not some billion dollar company executive with his own PR-staff

In the world of Youtube, Instagram and Blogs, is it really a challenge to get behind a microphone?

And no i don't think you should mislead a demographic by spreading false narrative and information to please what that demographic wants to hear. Jimmy's podcast is a tesla fan's wetest dream! (yes i know "wetest" is not a word but in this case it is!)

Propagating that Neural Network is a camera only position.
Propagating that only Tesla uses NN.
Propagating wrong info about the usage of NN.

That's some mighty reality distortion field. I absolutely think he did more harm than good.
You of all people know that 3 years ago the buzzling threads around here was "I can't decide if i want my Model 3 to self deliver itself, 1,000 miles is too much". Don't you miss those Tesla Mythology days?

9 out of 10 FSD posts in TMC, /r/teslamotors and electrek was "Tesla unlike others are using neural network".
It took me every ounce of energy in my body to drain the swamp. Now the narrative is "Tesla is using big complex neural network unlike others simple nn" which is better than the former although hilariously wrong since their networks can't even detect general objects, debris, traffic light, traffic sign, overhead road signs, road markings, barriers/guardrails, curbs, cone and is hilariously inaccurate and in-efficient.

But Now all of a sudden Jimmy wants to plunge us back to the myth age.
This is why from now on i will keep my TV channel stuck on verygreen, i goofed changing it.
Atleast he won't try to water board me with kool-aid.

Eww... all the mouth wash in the world still can't get the taste of kool-aid off my mouth even after just 30 mins of forced drinking.

Heck give me v9 model and i will provide you with the most un-biased in-dept analysis.
 
Last edited: