Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

HW2.5 capabilities

This site may earn commission on affiliate links.
So I got a chance to look at the network specification for the AP2 neural network in 40.1. As @verygreen previously reported, the input is a single 416x640 image with two color channels - probably red and grey. Internally the network processes 104x160 reduced frames as quantized 8 bit values. The network itself is a tailored version of the original GoogLeNet inception network plus a set of deconvolution layers that present the output. Output is a collection of 16 single color frames, some at full and some at quarter resolution. The network is probably a bit less than 15 million parameters given the file size.

So what does this mean? Images go into it, and for every frame in the network produces a set of 16 interpretations which also come in the form of grayscale images. Some external process takes those processed frames and makes control decisions based on them, probably after including radar and other sensors. This is not an end-to-end network: it doesn't have any scalar outputs that could be used directly as controls.

Also, the kernel library includes a number of items with intriguing names that are unused in the current network. At a minimum this must mean that there are variations of this network that have more features than they are exploiting in the current version, or that they have other networks with enhanced functionality which share the same set of kernels.
 
As a followup - this doesn't feel like the way you'd build the NN if you were doing it from a clean start. The output isn't in the most digestible format so this approach is going to rely heavily on further processing in a subsequent subsystem. And I am told that this is the only block running on the GPU, so the 'subsequent subsystem' is a much lower power cpu.

Two reasons occur to me for why you would want to rely on a cpu this way. One is that it's easier to program and debug on CPUs than GPUs. Generally you want to keep complicated, non-performance sensitive code on a CPU because it's much easier to manage that way. The other reason would be if you already had all the CPU code working and were just using the GPU stuff to replace something else, like the Mobileye subsystem from AP1.
 
Also, the kernel library includes a number of items with intriguing names that are unused in the current network. At a minimum this must mean that there are variations of this network that have more features than they are exploiting in the current version, or that they have other networks with enhanced functionality which share the same set of kernels.

Could you expand on this a bit? Perhaps share the "intriguing names?"

It'd be extremely interesting if we can get the proper inputs to this NN and run it through to see the outputs. For example, can this NN recognize street signs? which ones? what kind of cars? what angle of cars? etc.
 
  • Funny
Reactions: Silenus136
Interesting @jimmy_d. Why would it process 104x160 frames when it has enormously more data available?

Image recognition systems all do this. With four times as many pixels the larger image will take four times as long to process. Even if your eventual system has power to burn (as the AP2 GPU does for a network of this size) it can take weeks or months to train a network on a high speed machine, which slows down the speed at which engineers can test improvements to the system.

People who develop these networks generally start out with a study of different images sizes to see what impacts performance and then select with the smallest one that doesn't have any negative effect on accuracy. Apparently 104x160 is adequate to meet the objectives of the AP2 40.1 application.

To a person looking at an image it's much easier to recognize an object at higher resolution because of the way the brain processes images, but NN's can work at much lower resolution without losing accuracy. An NN isn't aided by two pixels when one will do, but the human visual cortex prefers redundancy.
 
Great answer, thanks. Seems like a tedious and/or risky job to figure out how small image size you can go without negative impact on accuracy.. I mean how do they actually test this? Say 104x160 is good enough for certain highways but they can't test all scenarios can they? Do they run this through sim?
 
Could you expand on this a bit? Perhaps share the "intriguing names?"

It'd be extremely interesting if we can get the proper inputs to this NN and run it through to see the outputs. For example, can this NN recognize street signs? which ones? what kind of cars? what angle of cars? etc.

Other names include 'shoulder', 'object', and ''vl_class'.

The network architecture will place certain limits on what it can process efficiently, which is how I know it's only processing a single image at a time. But what particular items it detects will depend on the training data. By looking at the gross capabilities of GoogLeNet (which won the imagenet competition a few years ago) I can state that this network, were it trained to such a purpose, could recognize signs, cars, trucks, motorcycles, pedestrians, trees, gutters. You could train it to tell breeds of dogs apart, whether it's looking at mountains, or a beach, or a sunset. Smaller networks that this one are routinely used for those kinds of things. ImageNet winners have to accurately recognize 1000 different categories of images with a single network and GoogLeNet can do that with fairly high accuracy.

For AP1 it's more likely that they are training it on a couple of dozen categories of objects, and that the output frames each present a map of where particular things occur in an image: lane markings, pavement boundaries, street signs, and other vehicles would be included, I think. It should be possible to look at the kernels and get an idea of what it's been trained to look for, but it would take a bit of work to figure out the storage format and write some software that can extract bitmaps of the kernels.
 
Great answer, thanks. Seems like a tedious and/or risky job to figure out how small image size you can go without negative impact on accuracy.. I mean how do they actually test this? Say 104x160 is good enough for certain highways but they can't test all scenarios can they? Do they run this through sim?

It's pretty easy actually, just kind of time consuming. The intermediate frame size is just a variable so you can change it easily, and the training and testing are all done automatically. So you compile a sequence of test networks with different internal frame sizes, train them, and see how they perform. You end up with a curve of accuracy that rises with increasing frame resolution and then it levels out at some image size. Generally you go with the size just above the knee of that curve.

As for what goes into the training images, well that's a long discussion but the short answer is usually everything you can get your hands on that is a good fit to the application. Of course Tesla might have plenty of data since they've got a large fleet out there driving around every day. Mobileye's Shashua has stated in the past that they (Mobileye) have so much data that they don't worry about overfitting (which can happen when you don't have enough).

It is possible to do a lot of this in simulation, though there are some risks there that you have to be careful about because simulation can differ from actual use in very subtle ways. I think simulation is a becoming an important part of training these networks though, because it's a very efficient way to generate large amounts of accurately labeled data.
 
Thanks for examining!
How would you interpret Karpathys 1 month old tweet with your current knowledge of the NN?

"To get neural nets to work one must be super-OCD about details. With bugs nets will train (they "want" to work), but work silently worse.

The "move fast & fix stuff until it compiles then it's probably fine" approach is inadequate when each bug silently subtracts 2% accuracy.

When my loss goes down it's not "cool, it's working!", it is "hmm it should be going down faster, something must be wrong". Okay, </rant> "
 
  • Informative
Reactions: pilotSteve
As a followup - this doesn't feel like the way you'd build the NN if you were doing it from a clean start. The output isn't in the most digestible format so this approach is going to rely heavily on further processing in a subsequent subsystem. And I am told that this is the only block running on the GPU, so the 'subsequent subsystem' is a much lower power cpu.

Two reasons occur to me for why you would want to rely on a cpu this way. One is that it's easier to program and debug on CPUs than GPUs. Generally you want to keep complicated, non-performance sensitive code on a CPU because it's much easier to manage that way. The other reason would be if you already had all the CPU code working and were just using the GPU stuff to replace something else, like the Mobileye subsystem from AP1.

It looks like it's more probable that there is another 'clean slate' implementation of the NN in the works. It makes no sense for the EAP / FSD codebase to be forever bogged down by the ME architecture. Also, if this is all their entire Autopilot team is doing, it looks it is fairly inefficient, which I highly doubt.

May be that is what JonMc was cautiously optimistic about.
 
Thanks for examining!
How would you interpret Karpathys 1 month old tweet with your current knowledge of the NN?

"To get neural nets to work one must be super-OCD about details. With bugs nets will train (they "want" to work), but work silently worse.

The "move fast & fix stuff until it compiles then it's probably fine" approach is inadequate when each bug silently subtracts 2% accuracy.

When my loss goes down it's not "cool, it's working!", it is "hmm it should be going down faster, something must be wrong". Okay, </rant> "

That comment by Karpathy is more or less common sense in DLNN development as I understand it. I believe the comment is probably directed at a general audience and is an attempt to explain something that might not be obvious to someone without experience in this area.

Cause and effect in NNs is extraordinarily complex so tracking down small bugs is an exercise in futility. In order to avoid having to do that you build the system very carefully and test all the components thoroughly before you run the system as a whole. Generally you want to have extremely high confidence that the code is doing what you intend for it to do before you ever start trying to find the source of a problem with a complete, running system. This makes it different from a lot of other engineering disciplines where you have more traceability between cause and effect and can 'debug into functionality' if you have to.
 
I wonder if what they are doing is using the NN to recognize features, then using those to build a local virtual world which the car then drives through rather than having the NN directly make the complete driving decisions. Not having programmed anything like that, I wouldn't be surprised if the driving path strategy might be better done algorithmically than in a NN since settingi up the training sounds pretty difficult.
 
Even if your eventual system has power to burn (as the AP2 GPU does for a network of this size)
This is all fascinating info, and it raises so many questions about what really has gone down over the last year. (It seems really unlikely to me they've spend a year training a net to differentiate between cars and motorcycles for instance) But as for something that maybe has an answer, can you estimate what % of GPU time is being consumed by the NN as-is?
 
  • Helpful
Reactions: TaoJones
This is all fascinating info, and it raises so many questions about what really has gone down over the last year. (It seems really unlikely to me they've spend a year training a net to differentiate between cars and motorcycles for instance) But as for something that maybe has an answer, can you estimate what % of GPU time is being consumed by the NN as-is?

I could do an estimate, but maybe we can ask @verygreen to run nvidia-smi or something similar and see what the actual usage is while it's running. It would produce a better number with less effort.
 
The problem with nvidia-smi is it's not shipped with the autopilot firmware. Nvidia SDK does not seem to be available for aarch64 (or at least I cannot readily see it) and I am not sure the tool itself is available (with the necessary libs) in a form that I can compile it myself.

We know that the cameras are sampled at 30fps so it's not any higher than that.
 
  • Informative
Reactions: croman
I'm assuming we'd need to know the desired frequency of evaluation. Is that known?

I've been told that it is, but that would be immaterial if we did a direct measurement with the right tool. The right profiling tool will average the activity over a useful window of time and return a duty cycle number that will tell you with pretty high confidence what fraction of the capacity is being used. The biggest confounding factor would be if the GPU throttles up and down according to workload, which can make the estimate a lot more complicated.
 
hm, I found some interesting internal stats periodically output into logs.
But the units used are not clear:

Code:
16:44:09.77952-0700 t_target_association | gpu load | 5664 | cpu load | 5664 | class instance load| 4368
16:44:09.77954-0700 lk harris core | gpu load | 38610048 | cpu load | 5248 | class instance load| 2160
16:44:09.77956-0700 tracker | gpu load | 1675328 | cpu load | 1758528 | class instance load| 21225544
19:58:49.12804-0700 lk harris core | gpu load | 9856128 | cpu load | 5248 | class instance load| 2160
19:58:49.12811-0700 tracker | gpu load | 1675328 | cpu load | 1758528 | class instance load| 21225544
19:58:49.12812-0700 t_pole_extraction | gpu load | 378176 | cpu load | 0 | class instance load| 304
19:58:49.12814-0700 t_seg | gpu load | 1331200 | cpu load | 0 | class instance load| 168
19:58:49.12819-0700 t_likelihood | gpu load | 1064960 | cpu load | 0 | class instance load| 128
I wonder it there's something in /sys that would tell us something
 
The problem with nvidia-smi is it's not shipped with the autopilot firmware. Nvidia SDK does not seem to be available for aarch64 (or at least I cannot readily see it) and I am not sure the tool itself is available (with the necessary libs) in a form that I can compile it myself.

We know that the cameras are sampled at 30fps so it's not any higher than that.

I guess they wouldn't need that in the shipping system would they. I'll see if I can find it. Nvidia uses cortex-A57 with a 64 bit linux on their Jetson development kit. Seems like they would have nvidia-smi or something similar.
 
  • Like
Reactions: spottyq