Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
@jimmy_d

maxresdefault.jpg
 
A new output called super_lanes has been substituted for a previously unnamed output. Super_lanes summarizes the image into a 1000 dimensional vector output, which is interesting because it probably means that the output of super_lanes is being fed into another neural network.

If I understand this right, the implication here is that some of the FSD decision-making will be done by a NN which we haven't seen yet.

If true, then we have been correct on assuming that FSD development has been going on in parallel to EAP. Very cool post, thanks!
 
@jimmy_d awesome stuff! Do you happen to know if they are using vision for the auto high-beam? And if so, have there been any changes? It's pretty terrible right now, and i would love to know if they are activity working on it.

They have to be using vision. It could be an NN, but it might be a different NN than the one that I'm seeing, or it could be the same one.

I haven't seen any functions that seem particularly tailored to dealing with the high beams but it wouldn't be hard to mix that into the other capabilities of the vision NN in a way that would be invisible to me. For instance, the super_lanes output is a high dimensional vector that could be used to capture the gestalt of what the camera sees. Embedded into that vector could be all kinds of things like "night, country road, parked cars, tail lights" and so forth. To extract that information a downstream process could pull the vector apart and evaluate it for some criteria - like "are there headlights coming towards me within 300 ft". If they use that approach then I'd never see evidence of it in the stuff I have access to.

OTOH it's been around for a while (high beams) so it's conceivable that they used a simple heuristic approach to get it out the door and maybe they haven't revisited it since then. It's the kind of thing that's easy to get working at a basic level but hard to get it to work really well. In the long run they'll want it in the NN but maybe right now they have more pressing concerns on the vision front? Hard to say.
 
If I understand this right, the implication here is that some of the FSD decision-making will be done by a NN which we haven't seen yet.

If true, then we have been correct on assuming that FSD development has been going on in parallel to EAP. Very cool post, thanks!

The decision making is not getting done in these networks. These are pure perception networks that only look at an instant of time and only process the input from a camera.

I can't come up with a compelling argument that either 1) FSD will be an extension of EAP or alternately that 2) FSD is not being developed in common with EAP. There's a ton of stuff that must be present in their FSD vision code which isn't showing up in what I've seen so far. For instance, visual flow needs to be processed from full frame analysis of sequential frames. The FSD demo video that they released clearly shows visual flow being extracted, but the vision networks I've seen so far are not capable of comparing sequential frames and downstream processing of the output of these networks is very unlikely to be capable of extracting flow. That means that there are, at a minimum, different vision networks are being used for FSD from what I see here for AP.
 
They have to be using vision. It could be an NN, but it might be a different NN than the one that I'm seeing, or it could be the same one.

I haven't seen any functions that seem particularly tailored to dealing with the high beams but it wouldn't be hard to mix that into the other capabilities of the vision NN in a way that would be invisible to me. For instance, the super_lanes output is a high dimensional vector that could be used to capture the gestalt of what the camera sees. Embedded into that vector could be all kinds of things like "night, country road, parked cars, tail lights" and so forth. To extract that information a downstream process could pull the vector apart and evaluate it for some criteria - like "are there headlights coming towards me within 300 ft". If they use that approach then I'd never see evidence of it in the stuff I have access to.

OTOH it's been around for a while (high beams) so it's conceivable that they used a simple heuristic approach to get it out the door and maybe they haven't revisited it since then. It's the kind of thing that's easy to get working at a basic level but hard to get it to work really well. In the long run they'll want it in the NN but maybe right now they have more pressing concerns on the vision front? Hard to say.

Thanks for the reply.

That makes total sense as to why you would not be able to see. I just wanted to check if they were using a specific NN for it or not.
 
The decision making is not getting done in these networks. These are pure perception networks that only look at an instant of time and only process the input from a camera.

I can't come up with a compelling argument that either 1) FSD will be an extension of EAP or alternately that 2) FSD is not being developed in common with EAP. There's a ton of stuff that must be present in their FSD vision code which isn't showing up in what I've seen so far. For instance, visual flow needs to be processed from full frame analysis of sequential frames. The FSD demo video that they released clearly shows visual flow being extracted, but the vision networks I've seen so far are not capable of comparing sequential frames and downstream processing of the output of these networks is very unlikely to be capable of extracting flow. That means that there are, at a minimum, different vision networks are being used for FSD from what I see here for AP.

If they get object recognition working well with EAP and they are using a ConvNet, then can they use the heat map output from each frame to track object movements for the higher level control routines to use?

Hypothetical: lanes are a set of path vectors to group objects and their movements. For example head on (parallel), cross (perpendicular), or other angles. Mapping would be more dense closer and sparse for object further where the objects don't matter as soon/ path is less predictable.
 
If they get object recognition working well with EAP and they are using a ConvNet, then can they use the heat map output from each frame to track object movements for the higher level control routines to use?

And I guess it would be easier to do this and get predictable and repeatable results based on a series of "sterialised" snapshots from super_lanes than from the raw feed.
 
If they get object recognition working well with EAP and they are using a ConvNet, then can they use the heat map output from each frame to track object movements for the higher level control routines to use?

Hypothetical: lanes are a set of path vectors to group objects and their movements. For example head on (parallel), cross (perpendicular), or other angles. Mapping would be more dense closer and sparse for object further where the objects don't matter as soon/ path is less predictable.

Quite true - that can be done. I don't mean to imply that there's no way to understand the movement of objects in the world from the output of the current AP vision. The method you suggest should work for that and I wouldn't be surprised if Tesla is already using that sort of thing to predict the trajectories of moving objects.

My comment about flow is referring to this particular technique: [1504.06852] FlowNet: Learning Optical Flow with Convolutional Networks. In the FSD video you can see the side view cameras are showing motion flow as green line overlays most noticeable when passing foliage and so forth. This isn't the detection of motion in identified objects, it's the detection of background motion through texture translation. The linked paper is an example of how this is done using NNs today. Of course there are other methods but they are computationally intense enough that they will be run on a discrete GPU. And the requisite infrastructure isn't present in the NNs that I have seen so far.

Sorry if I wasn't clear.
 
Quite true - that can be done. I don't mean to imply that there's no way to understand the movement of objects in the world from the output of the current AP vision. The method you suggest should work for that and I wouldn't be surprised if Tesla is already using that sort of thing to predict the trajectories of moving objects.

My comment about flow is referring to this particular technique: [1504.06852] FlowNet: Learning Optical Flow with Convolutional Networks. In the FSD video you can see the side view cameras are showing motion flow as green line overlays most noticeable when passing foliage and so forth. This isn't the detection of motion in identified objects, it's the detection of background motion through texture translation. The linked paper is an example of how this is done using NNs today. Of course there are other methods but they are computationally intense enough that they will be run on a discrete GPU. And the requisite infrastructure isn't present in the NNs that I have seen so far.

Sorry if I wasn't clear.

Oh no, I was the unclear one. I was just posing an idea, it was not a push back against your post.

I've messed around a bit with the motion detection examples from Nvidia. It's seemed to me that blobafying objects then tracking them and/or identifying pavement and applying a perspective based distance mapping might be effective for preemptive collision detection.
 
  • Like
Reactions: Matias
A new output called super_lanes has been substituted for a previously unnamed output. Super_lanes summarizes the image into a 1000 dimensional vector output, which is interesting because it probably means that the output of super_lanes is being fed into another neural network.

So is super_lanes the only vector that is an output from the main/narrow NN? Are there other output vectors too? Do they have names?

The main network now uses about 50% larger data flows between layers, which will increase the number of parameters by more than 2x and will substantially increase the representational power of the network and the amount of data required to train it. All other things being equal this network will have a more ‘nuanced’ set of perceptions. The inputs are the same and the outputs are the same with one exception

Wait a moment,...
you are saying the only change in the main/narrow neural network is they increased the data flow, and changed the name of an output vector?

I feel I missed something? The rest of the network architecture, layer size/type/order etc is all still the same?
 
Last edited:
Changes to repeater network

The repeater network in 10.4 has been truncated to 4 inception layers where the previous repeater network was a full 9 inception layers. The outputs are the same as before - a six class segmentation map (labels each pixel in the camera view as one of 6 categories) plus bounding boxes for objects.

Ooooh. Sorry if this has been mentioned before... does the main/narrow network also output a six class segmentation map and bounding boxes??

Changes to the fisheye_wiper network

Additionally, segmentation and bounding box outputs have been added for the fisheye, so it seems like the fisheye is also getting trained to recognize things other than rain. Which might mean that it’s also going to be scanning the field of view for cars and pedestrians, or it could mean that it’s specifically sensing stuff like bird poo and dead bugs so that it can respond appropriately.

Oooo! Good catch. I figured the wide angle camera would eventually have to be used for scanning a wider FOV for cars/pedestrians... but figured they are aways off from that point in development.
 
I got a chance to look at definition files for a new set of vision NNs which I understand to be the ones which are going out in 2018.10.4. I’m going to summarize the differences here. For background on what I found in earlier networks (2017.28, 2017.34, and 2017.44) please see this post from last November: Neural Networks

Cameras

I’ve seen three new networks which I’m going to refer to as main, fisheye, and repeater. These names come from filenames used for the network definitions as well as from variable names used inside the networks. I believe main is used for both the main and narrow forward facing cameras, that fisheye is used for the wide angle forward facing camera, and that repeater is used for both of the repeater cameras.

Overview

These network definition files that I’m talking about are used to describe how the ‘neurons’ in a neural network are arranged and interconnected - the network architecture. This architecture defines the inputs and outputs for the network and the shape of the tensors (the data) that flow through the network. By comparing the architecture to other well known networks it’s possible to understand what kind of processing is occurring, how the network is being trained, what kind of performance is possible, how much data of what kind is needed to train the network, and how much computer power is required to run the network.

Review of previous network

2017.44 was an inception network closely modeled on GoogLeNet - an award winning vision processing network design that was invented by Google about 4 years ago. GoogLeNet (GLN) is probably the single most popular high performance vision network in use today because it combines high accuracy with good computational efficiency. It can be slow to train but it runs very fast when deployed. The architecture is well understood and flexible - it can easily be adapted to different kinds of imaging data. The foundation of 2017.44’s main, repeater, and pillar networks (actually, introduced in 2017.42) was almost identical to GLN with only the most minimal changes required to adapt to the camera type and to provide the particular kinds of outputs that AP2 needed. The fisheye_wiper (introduced with 17.44) was based on a truncated GLN with 3 inception layers instead of the normal 9.

All of these networks had custom output stages that took the high level abstractions generated by GLN and interpreted them in various ways that would be useful for downstream processing. The fisheye_wiper network only put out a simple value - presumably an indicator of how hard it was raining. The repeater and pillar networks identified and located six classes of objects (Note that objects here can include not just discrete items like pedestrians and vehicles but also, for instance, areas of pavement) . The main network (used twice for both main forward camera and narrow forward camera) had generic object outputs as well as some more specialized outputs (for instance, for identifying the presence of adjacent lanes and the road shoulder).

Changes in 2018.10.4

As of 2017.44 - the most recent network I’ve seen that was a substantial departure from earlier versions - there were versions of main, fisheye, and repeater networks in use and also another network referred to as ‘pillar’, which was probably used for the b-pillar cameras. I understand that pillar is not present in 2018.10.4. This could mean that the b-pillar cameras were used in 44 but are not being used in 18.10.4, or it might not. In 44 the networks for the pillar and repeater cameras were identical in structure but had different parameters. It’s possible that they could be merged functionally, with a single network being used for both repeaters and for pillars. Merging them would reduce their accuracy but it could lead to procedural and computational efficiency gains.

Changes to the network for main and narrow cameras

The main network now uses about 50% larger data flows between layers, which will increase the number of parameters by more than 2x and will substantially increase the representational power of the network and the amount of data required to train it. All other things being equal this network will have a more ‘nuanced’ set of perceptions. The inputs are the same and the outputs are the same with one exception. A new output called super_lanes has been substituted for a previously unnamed output. Super_lanes summarizes the image into a 1000 dimensional vector output, which is interesting because it probably means that the output of super_lanes is being fed into another neural network.

(BTW - The internal name on this main network is now “aknet”. Andrej Karpathy Net ? )

Changes to repeater network

The repeater network in 10.4 has been truncated to 4 inception layers where the previous repeater network was a full 9 inception layers. The outputs are the same as before - a six class segmentation map (labels each pixel in the camera view as one of 6 categories) plus bounding boxes for objects.

Changes to the fisheye_wiper network

This network remains a truncated GLN. It appears to have been rewritten in a syntax that is now similar to the other networks. The previous fisheye network was in a different syntax and seemed to have been a holdover from some earlier generation of development tools. The new fisheye has some small changes introduced to the earlier layers but it still has just 2 inception layers and it still outputs a single class value for rain (one of 5 choices which are probably various types/degrees of rain). (It recently occurred to me that snow or ice might be included in these categories.) The new version seems to break the field of view into 4 quadrants and output a class for each one where the old network did not subdivide the field of view. Maybe rain looks different against the road rather than the sky. Additionally, segmentation and bounding box outputs have been added for the fisheye, so it seems like the fisheye is also getting trained to recognize things other than rain. Which might mean that it’s also going to be scanning the field of view for cars and pedestrians, or it could mean that it’s specifically sensing stuff like bird poo and dead bugs so that it can respond appropriately.

Summary

So the main and narrow camera network is getting quite a bit more powerful, the repeater has been simplified, and fisheye has been remodeled with possibly some non-rain functions being included.

As a reminder - these networks are only for processing camera inputs. They take single frame images and interpret them individually. Downstream processing has to take these camera outputs and interpret them as a sequence, combine them with perception from other sensors, and make driving decisions. This is only a small part of the overall driving software suite, but vision is an important part of the vehicle’s perception capacity and changes to these networks might be a good indicator of progress in the development of AP.
Thank you for the awesome detailed info... I can’t wait to see what 2-3 large updates look like when they actually use all cameras together. I feel like we are so close to being chauffeured (at least on the freeway).

Thank you for your insightful work again, please keep it up
 
  • Like
Reactions: MelaniainLA
You all convinced me to take a drive last night to see how improved things really were (they are; I won’t go into detail here since so many others have already commented and demonstrated and posted video...) but I did want to mention that I felt the auto high beams were working phenomenally last night.

Could totally be placebo, but it was something I noticed.

I also believe some adjustments had to be made to the car-control side of things. Previously, the car seemed aware it wasn’t in the lane - it KNEW it was over the lane lines, but last night, it actually steered correctly to stay IN the lanes that it knew were there. No doubt the NN updates probably play a very large role, but actual car control seems vastly improved too. (Maybe just a case of “less garbage in / less garbage out?”)
 
Ooooh. Sorry if this has been mentioned before... does the main/narrow network also output a six class segmentation map and bounding boxes??

The output branches of main/narrow are more complicated than just generic categories, but it has enough 'object category' type outputs that it could be a superset of what the other cameras are looking for.

Oooo! Good catch. I figured the wide angle camera would eventually have to be used for scanning a wider FOV for cars/pedestrians... but figured they are aways off from that point in development.

Yeah, in FSD the fisheye will have to do a lot of things besides operate the wipers.
 
  • Like
Reactions: MelaniainLA
jimmy_d- great review. Your last paragraph about "combining these camera outputs with the perception of sensors and making driving directions" will be the most critical step if TESLA has to roll our FSD in near future..
You all convinced me to take a drive last night to see how improved things really were (they are; I won’t go into detail here since so many others have already commented and demonstrated and posted video...) but I did want to mention that I felt the auto high beams were working phenomenally last night.

Could totally be placebo, but it was something I noticed.

I also believe some adjustments had to be made to the car-control side of things. Previously, the car seemed aware it wasn’t in the lane - it KNEW it was over the lane lines, but last night, it actually steered correctly to stay IN the lanes that it knew were there. No doubt the NN updates probably play a very large role, but actual car control seems vastly improved too. (Maybe just a case of “less garbage in / less garbage out?”)

Yes. In recent builds the car clearly knew when it was out of it's lane because it would almost always have correct lane lines showing in the display even as it wandered across the lane boundaries. There had to have been some kind of control limitation that was preventing the vehicle from moving back to the center of the lane even when it knew it was in the 'wrong' place. Another poster relayed information from Tesla tech support to the effect that the car was only 'allowed' to turn a certain amount depending on speed, which nicely explains this weird mismatch between it's perception and it's actions. If lane holding is substantially improved now (I haven't received the update, myself) then Tesla must be confident enough in the new lane detection and steering planning processes that they are comfortable with relaxing those steering limitations.
 
Yes. In recent builds the car clearly knew when it was out of it's lane because it would almost always have correct lane lines showing in the display even as it wandered across the lane boundaries. There had to have been some kind of control limitation that was preventing the vehicle from moving back to the center of the lane even when it knew it was in the 'wrong' place. Another poster relayed information from Tesla tech support to the effect that the car was only 'allowed' to turn a certain amount depending on speed, which nicely explains this weird mismatch between it's perception and it's actions. If lane holding is substantially improved now (I haven't received the update, myself) then Tesla must be confident enough in the new lane detection and steering planning processes that they are comfortable with relaxing those steering limitations.

I have a local winding road I test AP updates on and I can say that at high enough speed it still veers out of its lane across the center line on turns. It does correct back into the middle but if there were oncoming cars it would be quite dangerous.

And this isn't a freaky high speed, its a 35mph road where folks typically drive 40 to 45 and I set it at 40mph (due to local road Autosteer restrictions) and it will still veer.

At 35mph though the turn is negotiated perfectly, which it never did before. So it is improving, but its still not perfect there.

What I'm interested to test is freeway curves at 90mph Autosteer with no trailing car. That's always been incredibly sketchy as it gets really close to the wall. It seems to be a combination of detecting the curve too late and not turning enough. I haven't tried with 10.4 though!