Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
NN Changes in V9 (2018.39.7)

Have not had much time to look at V9 yet, but I though I’d share some interesting preliminary analysis. Please note that network size estimates here are spreadsheet calculations derived from a large number of raw kernel specifications. I think they’re about right and I’ve checked them over quite carefully but it’s a lot of math and there might be some errors.

First, some observations:

Like V8 the V9 NN (neural net) system seems to consist of a set of what I call ‘camera networks’ which process camera output directly and a separate set of what I call ‘post processing’ networks that take output from the camera networks and turn it into higher level actionable abstractions. So far I’ve only looked at the camera networks for V9 but it’s already apparent that V9 is a pretty big change from V8.

---------------
One unified camera network handles all 8 cameras

Same weight file being used for all cameras (this has pretty interesting implications and previously V8 main/narrow seems to have had separate weights for each camera)

Processed resolution of 3 front cameras and back camera: 1280x960 (full camera resolution)

Processed resolution of pillar and repeater cameras: 640x480 (1/2x1/2 of camera’s true resolution)

all cameras: 3 color channels, 2 frames (2 frames also has very interesting implications)

(was 640x416, 2 color channels, 1 frame, only main and narrow in V8)
------------

Various V8 versions included networks for pillar and repeater cameras in the binaries but AFAIK nobody outside Tesla ever saw those networks in operation. Normal AP use on V8 seemed to only include the use of main and narrow for driving and the wide angle forward camera for rain sensing. In V9 it’s very clear that all cameras are being put to use for all the AP2 cars.

The basic camera NN (neural network) arrangement is an Inception V1 type CNN with L1/L2/L3ab/L4abcdefg layer arrangement (architecturally similar to V8 main/narrow camera up to end of inception blocks but much larger)
  • about 5x as many weights as comparable portion of V8 net
  • about 18x as much processing per camera (front/back)
The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3M. The V8 main camera was 640x416x2 or 0.5M - 13x less data.

For perspective, V9 camera network is 10x larger and requires 200x more computation when compared to Google’s Inception V1 network from which V9 gets it’s underlying architectural concept. That’s processing *per camera* for the 4 front and back cameras. Side cameras are 1/4 the processing due to being 1/4 as many total pixels. With all 8 cameras being processed in this fashion it’s likely that V9 is straining the compute capability of the APE. The V8 network, by comparison, probably had lots of margin.

network outputs:
  • V360 object decoder (multi level, processed only)
  • back lane decoder (back camera plus final processed)
  • side lane decoder (pillar/repeater cameras plus final processed)
  • path prediction pp decoder (main/narrow/fisheye cameras plus final processed)
  • “super lane” decoder (main/narrow/fisheye cameras plus final processed)

Previous V8 aknet included a lot of processing after the inception blocks - about half of the camera network processing was taken up by non-inception weights. V9 only includes inception components in the camera network and instead passes the inception processed outputs, raw camera frames, and lots of intermediate results to the post processing subsystem. I have not yet examined the post processing subsystem.

And now for some speculation:

Input changes:

The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3MB. The V8 main camera processing frame was 640x416x2 or 0.5MB - 13x less data. The extra resolution means that V9 has access to smaller and more subtle detail from the camera, but the more interesting aspect of the change to the camera interface is that camera frames are being processed in pairs. These two pairs are likely time-offset by some small delay - 10ms to 100ms I’d guess - allowing each processed camera input to see motion. Motion can give you depth, separate objects from the background, help identify objects, predict object trajectories, and provide information about the vehicle’s own motion. It's a pretty fundamental improvement to the basic perceptions of the system.

Camera agnostic:

The V8 main/narrow network used the same architecture for both cameras, but by my calculation it was probably using different weights for each camera (probably 26M each for a total of about 52M). This make sense because main/narrow have a very different FOV, which means the precise shape of objects they see varies quite a bit - especially towards the edges of frames. Training each camera separately is going to dramatically simplify the problem of recognizing objects since the variation goes down a lot. That means it’s easier to get decent performance with a smaller network and less training. But it also means you have to build separate training data sets, evaluate them separately, and load two different networks alternately during operation. It also means that you network can learn some bad habits because it always sees the world in the same way.

Building a camera agnostic network relaxes these problems and simultaneously makes the network more robust when used on any individual camera. Being camera agnostic means the network has to have a better sense of what an object looks like under all kinds of camera distortions. That’s a great thing, but it’s very, *very* expensive to achieve because it requires a lot of training, a lot of training data and, probably, a really big network. Nobody builds them so it’s hard to say for sure, but these are probably safe assumptions.

Well, the V9 network appears to be camera agnostic. It can process the output from any camera on the car using the same weight file.

It also has the fringe benefit of improved computational efficiency. Since you just have the one set of weights you don’t have to constantly be swapping weight sets in and out of your GPU memory and, even more importantly, you can batch up blocks of images from all the cameras together and run them through the NN as a set. This can give you a multiple of performance from the same hardware.

I didn’t expect to see a camera agnostic network for a long time. It’s kind of shocking.

Considering network size:

This V9 network is a monster, and that’s not the half of it. When you increase the number of parameters (weights) in an NN by a factor of 5 you don’t just get 5 times the capacity and need 5 times as much training data. In terms of expressive capacity increase it’s more akin to a number with 5 times as many digits. So if V8’s expressive capacity was 10, V9’s capacity is more like 100,000. It’s a mind boggling expansion of raw capacity. And likewise the amount of training data doesn’t go up by a mere 5x. It probably takes at least thousands and perhaps millions of times more data to fully utilize a network that has 5x as many parameters.

This network is far larger than any vision NN I’ve seen publicly disclosed and I’m just reeling at the thought of how much data it must take to train it. I sat on this estimate for a long time because I thought that I must have made a mistake. But going over it again and again I find that it’s not my calculations that were off, it’s my expectations that were off.

Is Tesla using semi-supervised training for V9? They've gotta be using more than just labeled data - there aren't enough humans to label this much data. I think all those simulation designers they hired must have built a machine that generates labeled data for them, but even so.

And where are they getting the datacenter to train this thing? Did Larry give Elon a warehouse full of TPUs?

I mean, seriously...

I look at this thing and I think - oh yeah, HW3. We’re gonna need that. Soon, I think.

Omnidirectionality (V360 object decoder):

With these new changes the NN should be able to identify every object in every direction at distances up to hundreds of meters and also provide approximate instantaneous relative movement for all of those objects. If you consider the FOV overlap of the cameras, virtually all objects will be seen by at least two cameras. That provides the opportunity for downstream processing use multiple perspectives on an object to more precisely localize and identify it.

General thoughts:

I’ve been driving V9 AP2 for a few days now and I find the dynamics to be much improved over recent V8. Lateral control is tighter and it’s been able to beat all the V8 failure scenarios I’ve collected over the last 6 months. Longitudinal control is much smoother, traffic handling is much more comfortable. V9’s ability to prospectively do a visual evaluation on a target lane prior to making a change makes the auto lane change feature a lot more versatile. I suspect detection errors are way down compared to V8 but I also see that a few new failure scenarios have popped up (offramp / onramp speed control seem to have some bugs). I’m excited to see how this looks in a couple of months after they’ve cleaned out the kinks that come with any big change.

Being an avid observer of progress in deep neural networks my primary motivation for looking at AP2 is that it’s one of the few bleeding edge commercial applications that I can get my hands on and I use it as a barometer of how commercial (as opposed to research) applications are progressing. Researchers push the boundaries in search of new knowledge, but commercial applications explore the practical ramifications of new techniques. Given rapid progress in algorithms I had expected near future applications might hinge on the great leaps in efficiency that are coming from new techniques. But that’s not what seems to be happening right now - probably because companies can do a lot just by scaling up NN techniques we already have.

In V9 we see Tesla pushing in this direction. Inception V1 is a four year old architecture that Tesla is scaling to a degree that I imagine inceptions’s creators could not have expected. Indeed, I would guess that four years ago most people in the field would not have expected that scaling would work this well. Scaling computational power, training data, and industrial resources plays to Tesla’s strengths and involves less uncertainty than potentially more powerful but less mature techniques. At the same time Tesla is doubling down on their ‘vision first / all neural networks’ approach and, as far as I can tell, it seems to be going well.

As a neural network dork I couldn’t be more pleased.
 
Traffic lights and localization should be really simple problems to solve, if done right.

End-to-end neural networks, or human imitation: Forget it. There are too many localizations for it to work. You'd need equal amount of training data from each localization in the world, and somehow feed a geo-localization code into the network too so the network knows which legal rules apply. This is not scalable at all, nor easy to troubleshoot, assuming you have a big enough model and GPU capacity to even take all this information without compromising too much.

What you need to do is make a darn good visual interpretation network that is a rather limited size, and changes seldom, and thus highly reliable. This network will as one of many things recognize traffic lights, their shapes and which lights are illuminated within that shape. This info is transformed into a 3D world with the metadata.

All driving policy can be implemented with relatively "simple" software 1.0 code as long as you got the metadata interpeted. Apply location parameters to this driving policy module, and it will know that a green arrow to the right means "you're good to go" within Norway.
 
  • Helpful
Reactions: croman
All driving policy can be implemented with relatively "simple" software 1.0 code as long as you got the metadata interpeted. Apply location parameters to this driving policy module, and it will know that a green arrow to the right means "you're good to go" within Norway.

Disagree.
SW 1.0 can't deal with actions that break the law (too brittle). Broken car in front of you on 2 lane road in no passing zone? Stuck forever.
4 war stop where person with right if way doesn't know it? Stuck
Passing bicyclist? How wide a gap? How fast? Is the shoulder getting worse.
Construction zone with traffic flow on wrong side?
Snow plows only made one lane?
 
  • Like
Reactions: kbM3
Disagree.
SW 1.0 can't deal with actions that break the law (too brittle). Broken car in front of you on 2 lane road in no passing zone? Stuck forever.
4 war stop where person with right if way doesn't know it? Stuck
Passing bicyclist? How wide a gap? How fast? Is the shoulder getting worse.
Construction zone with traffic flow on wrong side?
Snow plows only made one lane?

Maybe as a first step, its better to let a human intervene/supervise those scenarios. I think Tesla will be doing driving policy as L2. Car tries to do stuff but driver is there to supervise.

It would really surprise me if Tesla somehow rolled out something before 2021 that actually drove by itself without driver input and could handle any of the scenarios you've laid out.
 
  • Like
Reactions: Cirrus MS100D
Maybe as a first step, its better to let a human intervene/supervise those scenarios. I think Tesla will be doing driving policy as L2. Car tries to do stuff but driver is there to supervise.

It would really surprise me if Tesla somehow rolled out something before 2021 that actually drove by itself without driver input and could handle any of the scenarios you've laid out.

We've already seen cars depart the lane to avoid other cars, even if it meant impinging on an improper one. Sane deal with shoulder usage, not legally drivable, yet usable.

Driving inputs are relative values not absolutes (well some values are really really high). So straight if or case statements become impossible to manage. Regulators wouldn't like it, but the physical reality of the situation trumps any law of the road. About to be rear ended at red light with no cross traffic? Run it.
 
  • Like
  • Funny
Reactions: kbM3 and Engr
We've already seen cars depart the lane to avoid other cars, even if it meant impinging on an improper one. Sane deal with shoulder usage, not legally drivable, yet usable.

Driving inputs are relative values not absolutes (well some values are really really high). So straight if or case statements become impossible to manage. Regulators wouldn't like it, but the physical reality of the situation trumps any law of the road. About to be rear ended at red light with no cross traffic? Run it.

But aren't those kinds of contextual decisions difficult for a NN? Even human judgment is fraught but together a machine and human might make a better outcome.

I am not aware of Tesla AP actually being aware of its surroundings enough to depart a lane to avoid other cars. My car, once when v9 first deployed, shifted lanes when it encountered a parked car but only because the lane lines were clearly marked . If there are no lane lines, it just sits there and will not commit to going past the car unless there is more than enough room before it encounters double yellows. So I'm not sure the NN is even empowered to do much, if anything, that isn't perfectly normal and legal. There are countless other scenarios of legal maneuvers that the car will refuse to do to get past or around obstacles.

Tesla clearly is trying to get to a point where driving is being handled by the NN (Karpathy heavily hints at it) but I think they will need to implement v10 + HD maps to even have a chance at determining legal driving paths (current software is good but not great at precisely identifying lanes vs. shoulders vs. exit lanes). Sometimes it will display a shoulder as a lane briefly especially while exiting on NavAP and when lanes divide on ramps, it always hugs lines when it should go down the center of the lane. Humans are very perplexed and think the car is pulling into the shoulder. Its just too raw of a product to envision something that seems capable of the kinds of decisions needed to successfully drive without firm rules.
 
  • Like
Reactions: lunitiks
We've already seen cars depart the lane to avoid other cars, even if it meant impinging on an improper one.

Source? I've never seen convincing evidence of this in Teslas (aside from the Tesla exiting the lane unintentionally due to being confused about where the lanes are -- which happens a lot), nor any convincing evidence that "side collision avoidance" actually exists.
 
Source? I've never seen convincing evidence of this in Teslas (aside from the Tesla exiting the lane unintentionally due to being confused about where the lanes are -- which happens a lot), nor any convincing evidence that "side collision avoidance" actually exists.

I may be having false memories, but I thought there were cases of cars getting lane changed into. I could deffinatly be wrong though since I don't have a reference handy.
 
Source? I've never seen convincing evidence of this in Teslas (aside from the Tesla exiting the lane unintentionally due to being confused about where the lanes are -- which happens a lot), nor any convincing evidence that "side collision avoidance" actually exists.

I have always wondered this. I see the occasional video on Electrek, yet never see the hands of the driver or the IC during the event. Makes me question the move of the vehicle. Car or driver.

Been close enough to kiss a 18 wheeler tire on the side, yet nothing but red ultrasonic sensors around the car and nothing audible or car taking control while using AP.
 
@mongo is not wrong, ultrasonics based side collission avoidance exists — of sorts.

In Autopilot 2.x, it manifests itself as lane assist listening to the ultrasonics and making adjustments to the opposing direction within the lane if something gets close. It is described in Model 3 manual as well so not an Autopilot 1.0 remnant in that case.
 
  • Helpful
Reactions: mongo
What was your speed? It only happens between a certain speed range. Also it requires clear lane markings because it will not leave the lane.

Highway speed. 60-80ish. Lane incursion by the truck into my lane. Not any place I like to hang out often or test on a repetitive basis. Fine lane markings.

Don't understand what you mean when you say it won't leave lane. As in the Tesla won't leave the lane to avoid the incursion, just apply brake to avoid?
 
Highway speed. 60-80ish. Lane incursion by the truck into my lane. Not any place I like to hang out often or test on a repetitive basis. Fine lane markings.

Don't understand what you mean when you say it won't leave lane. As in the Tesla won't leave the lane to avoid the incursion, just apply brake to avoid?

You are right that it should have worked in those circumstances. I have no idea why it did not. Was autosteer on by the way?

As for not leaving the lane what I mean it will steer clear of the approaching vehicle as best it can but within the lane.
 
  • Informative
Reactions: outdoors
Disagree.
SW 1.0 can't deal with actions that break the law (too brittle). Broken car in front of you on 2 lane road in no passing zone? Stuck forever.
4 war stop where person with right if way doesn't know it? Stuck
Passing bicyclist? How wide a gap? How fast? Is the shoulder getting worse.
Construction zone with traffic flow on wrong side?
Snow plows only made one lane?
These are harder situations, should focus on getting the everyday scenarios to work first and then iterate until these scenarios work too.

You don't need neural networks to solve this. Easily done with software 1.0. What you need is a path planner based on rules. Rules that you can easily iterate as you discover. You COULD "help" the rules by providing an input from a seperate contextual neural network that knows the 3d environment, but I'm not sure it's needed, and certainly not smart adding it until there is no other way.

Path planner works with these steps:
  • Your own car, drivable area (+eventual lanemarks), objects and their metadata (type, orientation, speed etc..) is inserted into a 3d virtual world from the vision data.
  • Extrapolate each object into the future using physics, its object type, direction and weighted with an uncertainty circle. The longer into the future the more uncertainty. Eg "kid"- object is mostly a circle of uncertainty growing bigger each second, but a car on a road is likely to follow edge of the road and lanes, with uncertainty of changing lanes without blinking.
  • Algorithmically find every place within your line-of-sight that is blocked by an object. You can eventually here at some point later add data from other vehicles and static cameras to remove these blindspots. For each blindspot add the "fake" object in your database with the largest uncertainty (eg a bike at pretty high speed behind every corner).
  • Plan out every physically possible driving path in your virtual 3d world you can take and rank them according to risk, how legal they are, comfort, etc...
  • Exclude the paths that are above your risk threshold, or too illegal. If somehow you get into a really bad situation, choose the possibility with lowest risk times # of lives.
  • Geometry class math to convert this path to steering wheel angle and watt-pedal.
  • Update/recalculate model as often as you can, perhaps 20 times a second.

This algorithm wouls. Slowly pass the stopped car in front if it can see far enough ahead that it's safe (fake objects), or the passengers tell the car it's worth the elevated risk.
  • Stopped cars at 4 way will likely start to slowly move forward because the vehicle to right has stopped. Usually the last vehicle to arrive drives, but every car can inch because the likelyhood of collision is 0 at low speeds. Need a little bit of random into the equation to make sure not all vehicles inch exactly the same time if they run the same algo.
  • Passing bicycle all handled by uncertainty and pathplanning of the bicycle object. You pass at the threshold between the bicycle's uncertainty for swaying AND the distance the fake oncoming car needs to pass IF the total risk is below your threshold. Or it would wait until the path ahead can be seen (fake objexct gets deleted).
  • Construction site is a bit dependent on visual hints and pose bigger problems not crossing the legal limit, but could work based on how you threshold it and how the signs matter into the equation.
  • Even 3 second rule don't need to be hardcoded. Your car knows it's physically possible for car ahead to suddenly brake full and this is the border of that car's uncertainty circle (back end). Your car will automatically avoid that area, and increase it if surface is slippery.

Nueral networks don't magically solve everything, you need a really solid iteratable algorithm like this. And then you use the neural networks to certain parts. Until someone invents a general AI engine, at that point you can just send the computer to the driver's school and tell it how to drive, but I got a feeling that will be a while longer.
 
@ChrML . Holy wow. Reminds me why I keep coming back to read all this stuff. My 12 year old is a NN thread reader. He chuckles from time to time

You are right that it should have worked in those circumstances. I have no idea why it did not. Was autosteer on by the way?

As for not leaving the lane what I mean it will steer clear of the approaching vehicle as best it can but within the lane.

Yes on AP. Interested on why not leave the lane. Entering the world of unintended consequences. Choices. Object out of lane. Side of road issues.
I just wonder how many of the episodes of we see are one where the driver is asked to take over because of an impending.....which is great by the way.
 
@ChrML .
Yes on AP. Interested on why not leave the lane. Entering the world of unintended consequences. Choices. Object out of lane. Side of road issues.
I just wonder how many of the episodes of we see are one where the driver is asked to take over because of an impending.....which is great by the way.

The behavior on AP might be different with regards to this. I've definitely had my car brake when another car started to drift into my lane (instead of just shifting over), though I'm not sure if that was phantom braking that just so happened at the same time. I don't think the car was far enough over the lane line to consider the car to be changing lanes. IMO this may be the better response anyway. Never seen it adjust its position in the lane in response to items around it, but I wouldn't be surprised if this is enabled at some point. Not sure why it didn't react in your case - it's probably using vision more in AP mode rather than the ultrasonics.
 
These are harder situations, should focus on getting the everyday scenarios to work first and then iterate until these scenarios work too.

You don't need neural networks to solve this. Easily done with software 1.0. What you need is a path planner based on rules. Rules that you can easily iterate as you discover. You COULD "help" the rules by providing an input from a seperate contextual neural network that knows the 3d environment, but I'm not sure it's needed, and certainly not smart adding it until there is no other way.

Path planner works with these steps:
  • Your own car, drivable area (+eventual lanemarks), objects and their metadata (type, orientation, speed etc..) is inserted into a 3d virtual world from the vision data.
  • Extrapolate each object into the future using physics, its object type, direction and weighted with an uncertainty circle. The longer into the future the more uncertainty. Eg "kid"- object is mostly a circle of uncertainty growing bigger each second, but a car on a road is likely to follow edge of the road and lanes, with uncertainty of changing lanes without blinking.
  • Algorithmically find every place within your line-of-sight that is blocked by an object. You can eventually here at some point later add data from other vehicles and static cameras to remove these blindspots. For each blindspot add the "fake" object in your database with the largest uncertainty (eg a bike at pretty high speed behind every corner).
  • Plan out every physically possible driving path in your virtual 3d world you can take and rank them according to risk, how legal they are, comfort, etc...
  • Exclude the paths that are above your risk threshold, or too illegal. If somehow you get into a really bad situation, choose the possibility with lowest risk times # of lives.
  • Geometry class math to convert this path to steering wheel angle and watt-pedal.
  • Update/recalculate model as often as you can, perhaps 20 times a second.
This algorithm wouls. Slowly pass the stopped car in front if it can see far enough ahead that it's safe (fake objects), or the passengers tell the car it's worth the elevated risk.
  • Stopped cars at 4 way will likely start to slowly move forward because the vehicle to right has stopped. Usually the last vehicle to arrive drives, but every car can inch because the likelyhood of collision is 0 at low speeds. Need a little bit of random into the equation to make sure not all vehicles inch exactly the same time if they run the same algo.
  • Passing bicycle all handled by uncertainty and pathplanning of the bicycle object. You pass at the threshold between the bicycle's uncertainty for swaying AND the distance the fake oncoming car needs to pass IF the total risk is below your threshold. Or it would wait until the path ahead can be seen (fake objexct gets deleted).
  • Construction site is a bit dependent on visual hints and pose bigger problems not crossing the legal limit, but could work based on how you threshold it and how the signs matter into the equation.
  • Even 3 second rule don't need to be hardcoded. Your car knows it's physically possible for car ahead to suddenly brake full and this is the border of that car's uncertainty circle (back end). Your car will automatically avoid that area, and increase it if surface is slippery.

Nueral networks don't magically solve everything, you need a really solid iteratable algorithm like this. And then you use the neural networks to certain parts. Until someone invents a general AI engine, at that point you can just send the computer to the driver's school and tell it how to drive, but I got a feeling that will be a while longer.

I’ve always wondered about the road debris case. E.g. anvil in the road is unsafe to hit, rubber tread from a truck tire is safe, styrofoam container???

Thoughts?