Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
I have a feeling people use "shadow mode" interchangeably with "magic" and hope people would expand what do they mean by the shadow mode and how it actually works ;)

Sure NN can categorize your image stream, but how are you going to validate for false positives (i.e. something mislabeled) and false negatives (something that should have been labeled and was not)?
Sure you can have a verifier NN to doublecheck the main one, but if it's perfect - why not use it instead of the main one? Also we see no evidence of checker NNs anyway and if you ship every frame to mothership for verification that's not going to scale all that well (and we have no evidence of that either)

I certain agree that too much is made of the shadow mode. But, I would diminish the importance of it.

It's extremely useful for AEB. The purpose of AEB is to act as a crash mitigation system, and as such they're much more concerned with false positives than false negatives.

Plus for some false negatives the radar can be used. Where it can send the image stream when the radar detects a moving car right there, and the vision NN says "I see nothing".

Of course AEB is the only case we KNOW that the shadow mode is used. It would be interesting to see if it's used in other cases.

Personally I want the shadow mode to be used for lane steering because i don't like AP's lane placement. So it's not so much false positives or false negatives, but in more of a reinforcement learning way where it compares itself to how I would drive.
 
So basically everything is written in C ? switch, java, go, php ?

I'm just a pharmacist who messes around with C++ and Qt and the Linux Shell. So I can easily export the libraries, cool.

You can build a toy NN in anything, the basics are really simple and modern computers are fast enough that you can solve simple problems in even slow languages. Heres a great NN demo tool written in javascript:

Tensorflow — Neural Network Playground

I've seen NN's implemented in java and go, and also lisp, haskell, basic, objective-c, pascal, scala, scheme, ... you name it.

But for doing stuff where performance matters even a little everyone builds on top of one of a few highly optimized matrix math libraries, which are all written in C. NNs are still considered extremely compute intensive as algorithms go, and accuracy is often a function of network size and latency so there's a tendency to use the biggest network you can afford for any deployed application and you can always fit a bigger and faster one if you write all of it in C.

But you can write in java and call C libraries just fine, so for people who like to code in java and want to write an NN they can.

Speaking of python and so forth - deep learning frameworks tend to be written as object oriented programming libraries and are so highly integrated and have such specialized call dependencies that, effectively, they become a computer language themselves. Tensorflow's low end is written entirely in C++, and the high end is written in Python, but when you build a network in Tensorflow you're really kind of writing in Tensorflow language. Torch's high level language is Lua (don't ask), but you don't really write in Lua, you write in Torch.
 
You can build a toy NN in anything, the basics are really simple and modern computers are fast enough that you can solve simple problems in even slow languages. Heres a great NN demo tool written in javascript:

Tensorflow — Neural Network Playground

I've seen NN's implemented in java and go, and also lisp, haskell, basic, objective-c, pascal, scala, scheme, ... you name it.

But for doing stuff where performance matters even a little everyone builds on top of one of a few highly optimized matrix math libraries, which are all written in C. NNs are still considered extremely compute intensive as algorithms go, and accuracy is often a function of network size and latency so there's a tendency to use the biggest network you can afford for any deployed application and you can always fit a bigger and faster one if you write all of it in C.

But you can write in java and call C libraries just fine, so for people who like to code in java and want to write an NN they can.

Speaking of python and so forth - deep learning frameworks tend to be written as object oriented programming libraries and are so highly integrated and have such specialized call dependencies that, effectively, they become a computer language themselves. Tensorflow's low end is written entirely in C++, and the high end is written in Python, but when you build a network in Tensorflow you're really kind of writing in Tensorflow language. Torch's high level language is Lua (don't ask), but you don't really write in Lua, you write in Torch.

Some additional examples that I know of.

How to do inference within C++, and more importantly has a great tutorial on one way to train networks.
GitHub - dusty-nv/jetson-inference: Guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and Jetson TX1/TX2.

Reinforcement learning where there are examples with cpp, python, and lua
GitHub - dusty-nv/jetson-reinforcement: Deep reinforcement learning libraries for Jetson with pyTorch and openAI Gym

Both code repositories are really meant to be used on a Jetson TX2 development kit. It's a fairly cost effective tool used by people who want to deploy Deep learning in things like drones, robots, etc.
 
I certain agree that too much is made of the shadow mode. But, I would diminish the importance of it.

Of course AEB is the only case we KNOW that the shadow mode is used. It would be interesting to see if it's used in other cases.

Personally I want the shadow mode to be used for lane steering because i don't like AP's lane placement. So it's not so much false positives or false negatives, but in more of a reinforcement learning way where it compares itself to how I would drive.
How do we know shadow mode is used for AEB? How is it used? How does it manifest itself?
 
How do we know shadow mode is used for AEB? How is it used? How does it manifest itself?

I only know of what's written about it.

This is where I was getting my information.

Tesla adds ‘calibration period’ for vehicles with new ‘Autopilot 2.5 hardware’

The problem with interpreting how it works is I can't say I know what the goal of AEB is. It used to be a strictly crash mitigation system where it wasn't designed to activate unless an accident was deemed unavoidable. Where it would take action at the last possible moment if the driver didn't take evasive action (braking, turning, etc). That was a long time ago with AP1.

If that's still the case it makes it fairly easy to run in shadow mode. Since it's actions are binary (hit the brakes as hard as I can, or do nothing) it's easy to determine when there was a false positive or a false negative.

If it says hit the brakes, but the user doesn't then there is a disagreement. The truth is whether an accident happened or not.

If it says don't hit the brakes, but the radar detects a situation which warrants braking then there is a disagreement. Especially if the user hit the brakes.

With AEB they are also a lot more concerned about false positives so they probably don't mind the false negatives results not being 100%.

One question I have for you is whether jamming the brakes suddenly will upload a video stream to the mothership? If I was at the Mothership this is a situation I would want reported from the fleet. So I could figure out why a driver braked full power for no reason at all.
 
Last edited:
Jimmy_d-

I love this write up. It's exactly what I crave to read on these forums. Sure, it's just educated guesswork, but it's invaluable to an enthusiast like me. Thank you.

It's nothing compared to the teardown you did on your ape. I was rereading all that stuff recently when I was trying to figure out some hardware stuff relative to the NN. It reminded me of how much we learned when you did that.
 
I only know of what's written about it. This is where I was getting my information.

Tesla adds ‘calibration period’ for vehicles with new ‘Autopilot 2.5 hardware’

The problem with interpreting how it works is I can't say I know what the goal of AEB is. It used to be a strictly crash mitigation system where it wasn't designed to activate unless an accident was deemed unavoidable. Where it would take action at the last possible moment if the driver didn't take evasive action (braking, turning, etc). That was a long time ago with AP1.

If that's still the case it makes it fairly easy to run in shadow mode. Since it's actions are binary (hit the brakes as hard as I can, or do nothing). If it says hit the brakes, but the user doesn't then there is a disagreement. The truth is whether an accident happened or not.
well, it was very nebulous statement that said
During that process, Automatic Emergency Braking will temporarily be inactive and will instead be in shadow mode, which means it will register how the feature would perform if it were activated, without taking any action

The way I read it is "AEB logic would work, but not provide any actual braking, would not know what would have happened if braking was actually provided".
This somewhat aligns well with a circa same time snapshot request I saw that literally said this "if there's something like an AEB event happening, take snapshot, send it to mothership".
Then presumably they analyzed the snapshots submitted and adjusted logic to weed out obvious false positives. See the flaws here?

1. Every false negative where AEB did nto trigger but the driver managed to intervene in time still would be completely missed.
2. Every false negative that the driver did not intervene in had resulted in a crash (generating another snapshot) - at least hey got some debug data out of those (assuming any of them happened, but still far from ideal).

So they basically got to weed out false positives only with unknown impact on false negatives. I know AEB is not promised to work 100% of times and such of course so eventually via the collected crash data false negatives would be weeded out (hopefully).

Another important thing here is to realize that they probably had way too many false positives to disable the feature completely and debug it further, at which point the whole "shadow mode" thing turns into a bit of a PR exercise since "we are disabling AEB because the new hardware made the safety features dangerous. We'll reenable AEB when we feel it's safe again" does not sound very positive ;)
 
  • Funny
Reactions: Cirrus MS100D
Actually it's a recent development that the NNs were getting a change in every firmware revision.
Before single NN would be reused across multiple releases. E.g. the NN introduced in 17.24 was with us until at least 17.36 and possibly until 17.38 (I just don't have a copy of 17.38 to confirm) with zero modifications.

That reminds me that I meant to ask if you compared the network weights in earlier versions. It would be interesting if the file contents were changing while the files size stayed the same because it would mean they were retraining the network between versions. And from my recent analysis we could tell what parts were being retrained since the fbs file has the layer weights separately labeled.

Hmm, come to think of it I can probably do that with 40 and 42 for the parts that didn't change structure. Have to look at that.

The calibration is ongoing process, it's just before you have some initial calibration autopilot does not enable, but once you have it it still is constantly adjusted (and mothership gets updated calibration data like ever 5 minutes of driving!) hence when they work on a camera assembly or change a windshield o whatnot typically they do not reset the calibration.

That's really interesting. The accuracy of depth estimates is pretty sensitive to being able to compensate for tiny variations in the relative locations of the two cameras, so I guess it makes sense to check and update the calibration continuously - especially if it's being used for any critical decisions. Microns matter at the limits of sensitivity. Conceivably even thermal expansion could affect it. Of course potholes or hitting a curb could change the alignment too. If relative rotation of the cameras is affected then the whole calibration probably has to be redone but small translations would only require tweaking the calibration parameters. For objects within a couple of car lengths the effect probably isn't huge, but something a couple of hundred feet away could get really finnicky.

Oh wow, I just thought of something else. You know how they were talking about generating a point cloud with the radar after they announced that it was going to become a primary sensor (after the fatal accident)? Well with stereo cameras they can generate a point cloud using vision. This is something that could be very high resolution - higher than lidar under some circumstances, although it wouldn't be quite the same since accuracy falls linearly with distance and you only get really good estimates for surfaces that have a nice visual texture to them. But it could still be a really interesting way to supplement any point cloud you were getting with the radar.
 
That reminds me that I meant to ask if you compared the network weights in earlier versions. It would be interesting if the file contents were changing while the files size stayed the same because it would mean they were retraining the network between versions. And from my recent analysis we could tell what parts were being retrained since the fbs file has the layer weights separately labeled.
It's quite clear the the NNs are provided from some other project as "drops" to be used by the ape "firmware". It's quite evident when the NN "drop" changed. the whole trained model (the big blob) and the descriptions did not change between 17.24 and 17.36

Hmm, come to think of it I can probably do that with 40 and 42 for the parts that didn't change structure. Have to look at that.
Here it's quite clear there was some retraining since the description of hte model changed some as well, as such I imagine everything changed in the binary blob.

Oh wow, I just thought of something else. You know how they were talking about generating a point cloud with the radar after they announced that it was going to become a primary sensor (after the fatal accident)? Well with stereo cameras they can generate a point cloud using vision. This is something that could be very high resolution - higher than lidar under some circumstances, although it wouldn't be quite the same since accuracy falls linearly with distance and you only get really good estimates for surfaces that have a nice visual texture to them. But it could still be a really interesting way to supplement any point cloud you were getting with the radar.
While I don't see any references to a point cloud, there's something called "particle filter" and you can even get a snapshot of it.
 
  • Informative
Reactions: Cirrus MS100D
Every false negative where AEB did nto trigger but the driver managed to intervene in time still would be completely missed.

The problem with AEB is we don't know the design parameters.

If it's still a crash mitigation system (like AP1) it's only designed to reduce the severity of the crash. So the example case you described was not a false negative at all.

I wouldn't have activated + crash = Woops
I wouldn't have activated + no crash = working great
I would have activated + crash = working great
I would have activated + no crash = Woops

My argument is that AEB is the perfect test case for using "shadow mode" because the concern is false positives. It's the one of paramount concern.

Now that doesn't mean there aren't ways to determine if a false negative would have occurred had the user not taken action. I imagine the outputs of the Radar/Vision, and user inputs (braking, steering and hopefully swearing) are compared at some point. Where triggers are setup to send data to the mothership. Now they probably don't have swearing, but it would likely be helpful. :)

I should add that the official reason Tesla gave for needing to have this calibration/shadow mode stage was because HW changed in 2.5, but the only significant change was the radar.

So in this specific example it's not about validating an NN, but validating a new radar. Tesla doesn't have much in the way of validating new hardware, and instead pushed that task onto it's owners.
 
Last edited:
I can't compete with the internet in terms of educational material. There's so much great stuff out there that explains deep learning it's just crazy. Of course if there's any particular thing you would like pointers on I'll try to help.

I haven't been posting much stuff on the AP2 NN lately because real facts are hard to come by and I'm not sure how much people want to listen to me speculate. Certainly there seems to be a cohort which is allergic to any speculation that doesn't confirm their biases and the blowback gets old. Our mutual friend gave me access to copies of recent and some older data on the NN's and I took them apart to see what I could learn from them. I also tried building and analyzing various parts of the network description on the frameworks that they seem to have been created on, which yielded some ideas but mostly it just killed off some theories that I had.

But here's what I've got as of today:

Facts:

1) There's a vision NN running in AP2 that takes images from the main and the narrow cameras and processes them to extract feature maps. There's only one network but it runs as two instances in two threads that separately process each camera independently.
2) The front half of the NN in AP2 is basically Googlenet with a few notable differences:
- The input is 416x640 (original Googlenet was 224x224)
- The working frame size in Googlenet is reduced by 1/2 in each dimension between each of the 5 major blocks. The AP network omits the reduction between blocks 4 and 5 so that the final set of features is 2x2 times larger than in Googlenet.
3) The final set of output features from the googlenet "preprocessor" is digested in several different ways to generate output. All of these outputs are floating point tensor fields with 3 dimensions and they all have frame sizes that either match the input frame size of 416x640 or a reduced version of it at 104x160.
4) All of these output tensors are constructed by deconvolution from the output of the googlenet preprocessor.
5) There are no scalar or vector outputs from the NN, so it is not end-to-end in the usual sense. It's output must be interpreted by some downstream process to make driving decisions.
6) Between 40 and 42 new output categories were added to the NN
7) Sometime in the last four months major changes were made to the deconvolution portions of the NN that included the removal of network sections that are normally associated with improving the accuracy of segmentation maps. This happened after 26 but before 40, an interval in which users reported substantial improvements in operation at highway speed.
8) Between 40 and 42 two additional NNs were added to the code. Neither has been seen in operation.
9) One of the new NNs is named fisheye_wiper. The network itself is a substantially simplified googlenet implementation. The output from this network is a five output classifier indicating that it outputs a single one of 5 categories being detected. This is exactly the output you would expect for a rain intensity detector intended to control wiper blades.
10) The other new NN is named repeater. This network includes a googlenet preprocessor almost identical to the one used for the main and narrow cameras. The outputs are similar to a subset of main narrow.


Stuff that's probably true:

1) Forward vision is probably binocular using cropped, scaled, and undistorted segments of main and narrow that are trimmed to provide identical FOV. This allows the car to determine depth from vision independently from whatever depth information it gets from other sensors (radar). The depth extraction is not being done in the NN but there are function names in the code that refer to stereo functions and frame undistortion. Depth extraction can be done efficiently using conventional vision processing (that is, non NN) techniques so that's probably why it's not included in the NN. I haven't seen or heard about any AP2 behavior that can only be explained by the existed of binocular vision so it's possible this is a shadow feature. No real idea.

2) Output from the forward vision NN probably includes at least one full resolution semantic segmentation map with bounding boxes for that map. It might include several maps including some at lower resolution. Output from the repeater NN is similar though it eliminates some classes that are present in the main/narrow NN. It is extremely likely that both the main/narrow and repeater networks are detecting objects in multiple classes and generating bounding boxes for those objects.

3) The NN is probably getting a lot of development. Not only does it change for every single firmware revision, some of the changes are drastic. Some look like experiments, others look like diagnostic features. Some look like transient bug fixes that are later corrected by fixing the underlying problem (eliminating the need for the hack).

4) The network is probably being trained in part by using simulated driving data. Generating labeled training data for semantic segmentation outputs of the kind the vision NN seems to be outputting is extremely labor intensive if done manually and simulation is a common approach to augmenting training data. My review of Tesla's open positions in their Autopilot division found that they were recruiting simulation experts and simulation artists with a bias towards driving environments and the Unreal 4 engine. I found no positions of the type that I would expect to see if they were manually labeling large volumes of real world data. Of course, this latter is the kind of thing you might outsource, so it's not a very strong datapoint except that it fails to show they are manually labeling lots of data.

5) Tesla is probably developing fairly advanced customizations to the tools used for testing, training, developing, and deploying neural networks. I found evidence for custom libraries and custom tools all over the place as I was trying to track down clues to how they were doing their development. There isn't anything I could find in their system that you could just drop into publicly available tools and make sense of it, but then I find references to parts of publicly available tools and libraries all over the place. It seems like they are pulling from a lot of different sources but then building their own tools.

6) After analyzing the options, I think the calibration phase for the cameras is to allow matching the main and narrow cameras to a high enough accuracy to enable stereo vision. When I look at manufacturing variances for cameras and compare that to the operational variance that the system has to be able to deal with in use the only thing I can come up with that can't be compensated on the fly is support for stereo vision. In order to support rectified aligned image stereo processing you need to pre-calculate the alignment transformations for the two cameras to sub-pixel accuracy. I don't think this can be factory calibrated because the calibration probably wouldn't survive transport of the vehicle to it's delivery destination. That's not true of any other vision process that I can come up with and it's certainly not true of NN vision processes that are appropriate to vehicle applications. Calibration probably has to be re-done frequently by the car because even normal operation will lead to the alignment drifting enough for it to become a problem. And it probably needs to go through the cal process anytime there is maintenance performed that requires manipulation of the forward camera assembly. If this is true then AP2 must have been using stereo since it's first incarnation as the calibration period has always been present, and it must be using it for driving decisions since you can't use the AP features until the cameras have been calibrated.

Pure speculation:

After my first look at the version 40 NN I was surprised at how simple it was, conceptually, and how 'old' the circa 2015 architectural concepts were and speculated that perhaps this version of EAP was not getting much effort. (In the deep learning world 2 years is an eternity). I thought this might make sense if the company were pushing to release a separate and much more sophisticated package and needed this current NN to be just a placeholder substitute for the missing AP1 vision chip while they readied a much more ambitious system, which might be FSD. After reviewing older versions of the network I found that the output varied from version to version so this system can't be merely a substitute for a missing module - if that were the case the I/O wouldn't be changing. Also, stereo vision is clearly a feature that wasn't possible in AP1. And now we see NNs being added for cameras not present in the AP1 hardware - the repeaters. Finally, I am finding small features being included in the NN that did not appear in the public literature until fairly recently, implying that the EAP team is actively trying out cutting edge ideas in limited domains. So at this point I think that EAP is getting the kind of development that suggests it's not a placeholder.

I recall that at some point one of the differentiating features of EAP and FSD was the number of cameras, with EAP having 4 in use and FSD the full suite of 8. With the addition of the repeaters to main and narrow we would see 4 driving cameras in use assuming the fisheye is just for the wipers in the EAP use case. That would be an interesting match up and it might indicate that these four are the candidates for on-ramp to off-ramp level of functionality in EAP.

Thanks for this. Regarding the binocular approach and stereo references, Could Tesla use the same approach with the rear camera and the rearward looking side cameras to create depth in the drivers blind spots? If so, this could be the reason Tesla did not add more radar: They are creating depth in the blind spots with those 3 cameras.

Is there a limitation regarding positioning of two cameras in order to create depth?
 
And to add to the "stereo" question: when AP2 first went live, there was a fair amount of evidence that only the main camera was active. But the cars still needed calibration.

Do we know what else is being calibrated? Assuming this step is pre-NN?
 
  • Like
Reactions: zmarty and KyleDay
Apologies if it has been already discussed before. Is Tesla using CUDA platform/API to interface with the nVidia GPU processors for the Autopilot? As an investor, I find this a very crucial piece of info because the use of CUDA means Tesla is married to nVidia hardware with little or no bargaining power. Thanks.
 
  • Helpful
Reactions: Intl Professor
thanks for that link. It seems simplifying that network actually improves the speed of learning. Perhaps Tesla has simplified its NN for vision in the more recent releases?

That playground is worth playing around with. It takes a really long time to learn all the lessons it has to teach. The one you just mentioned is an important one - a smaller tailored network will learn a lot faster and become a lot more accurate than a bigger unconstrained network. But the devil is in the details and there are a lot of things you can tune in a neural network so every lesson you learn has lots of caveats.
 
If it's still a crash mitigation system (like AP1) it's only designed to reduce the severity of the crash. So the example case you described was not a false negative at all.
Well, it depends on what was the input. If the driver jerked the steering wheel and avoided the obstacle just barely, it's still pretty much a false negative I suspect.

But anyway all of this is besides the point, I specifically was wondering about the "shadow mode" as referred to NN training where it's not as clear cut.