Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Neural Networks

This site may earn commission on affiliate links.
The Aptiv PR employee told me flat out that the camera was not used for driving. This was only about a month ago. While I trust your judgement/knowledge (and esp that of Bladerskb), I've got to believe what I was told. Both because of the source of the info, and the date of the info.
 
Is Tesla using much of NN for Advanced Summon ?

Wondering why it is taking them so much time to get a decent version out.

It could be any number of things.

The lack of any down facing cameras to detect curbs/kids
The lack of any rear cross radar to help detect cars as it reverse
The driving policy itself. This would include anything from how it reacts to people blocking its path to route planning.
Problems with the Neural Network. Dancing cars is kinda funny when sitting idle at a parking lot, but a little less funny to a car trying to make a decision on where to drive.

Lastly how good should it be? There are a few uses cases for summons as it exists now, but it would be so much more useful if it could at least hit 5mph. If it could do the reverse where it parked itself it would be a heck of lot more useful.

I've certainly noticed that expectations are rising for advanced summons as the months go on. A few months ago I think people would have be content with what advanced summons now does (it's faster and better than it was).
 
He actually said "similar" and "approximate", which is telling. Also, the demo showed a still scene. Obviously you can recover a scene using offline computation (i.e. in non-realtime). One of the most well-known examples of this are Apple's "flyover" maps which are recovered from 2D aerial photography. The trick is to do this in realtime in a comparable accuracy and resolution as Lidar, which I have never seen demonstrated (and I follow this field quite closely). If Tesla was able to do this, they wouldn't crash into parked Firetrucks.
Whether slam is possible in real-time or not should be quite dependent on the level of detail needed to be captured.
Here's a realtime slam calculation from 5 years ago; but only based on tracking contrast points, similar to what some people saw on a leaked dev mcu shot sometime in 2017(18?);

Slam can get accuracy of around 10cm/4" with appropriate cameras/distances, which is better than I would consider myself to be able.

In the end, LIDAR needs a neutral net trained on 3D point groups to perceive that set as a certain object.
Vision does the same thing with a 2D pixel group.

I'd venture to say, as vision NNs require human labeled training data, so does LIDAR, and I'd bet the humans doing the labeling require video for labeling edge cases.

The question is if the higher precision of LIDAR is the essential differentiator to enable a perception level necessary to obtain a sufficient number of nines to consider the AV to have an accurate perception of it's surroundings.

Personally I don't think so.

But I also think Elon was wrong so many times in his timeframe predictions that starting everything "planned years ago" is obviously wrong. Just looking back at the past 2¾ years since the introduction of AP2, the pace of improvement, while quite amazing to experience, is nowhere near where it should have been given Elons announcements (pick any since October '16).
Especially regarding the worsening of Autopilot with 2019.16 and subsequent versions (in Europe, nota bene!), I don't get my hopes up. NoA lane change suggestions here are completely unusable, rather they're outright dangerous. It will likely take much more than 8 months until unconfirmed ones might become available, which would place it firmly in 1st quarter 2020.

And, as I've said numerous times, the autonomy day video was a programmed route on an undisclosed software build, which is essentially the same as the infamous October'16 video. Enhanced Summon isn't anywhere to be seen on consumer cars, it's now 300% longer than Elons prediction (mid-April).
I'll believe it when I see it, but all I see arrive in cars are more dashboard games.

And trust me, I can't wait for feature complete FSD.
 
  • Like
Reactions: Kant.Ing
And, as I've said numerous times, the autonomy day video was a programmed route on an undisclosed software build, which is essentially the same as the infamous October'16 video.
This is clearly wrong. See the hacked Tesla on 2.5 with all the features we saw on demo day. Of course they made sure on that route it would work well, but was not a hack.
 
  • Informative
Reactions: jepicken
Thanks, I wasn't aware of those. The videos are from early June though, not from investor autonomy day. Still, a first glimpse of progress, even though AP needs overrides in many spots that the autonomy day video didnt. Judging from the video it looks like it may get ready in the US this year. Europe - years away.
 
NN Changes in V9 (2018.39.7)

Have not had much time to look at V9 yet, but I though I’d share some interesting preliminary analysis. Please note that network size estimates here are spreadsheet calculations derived from a large number of raw kernel specifications. I think they’re about right and I’ve checked them over quite carefully but it’s a lot of math and there might be some errors.

First, some observations:

Like V8 the V9 NN (neural net) system seems to consist of a set of what I call ‘camera networks’ which process camera output directly and a separate set of what I call ‘post processing’ networks that take output from the camera networks and turn it into higher level actionable abstractions. So far I’ve only looked at the camera networks for V9 but it’s already apparent that V9 is a pretty big change from V8.

---------------
One unified camera network handles all 8 cameras

Same weight file being used for all cameras (this has pretty interesting implications and previously V8 main/narrow seems to have had separate weights for each camera)

Processed resolution of 3 front cameras and back camera: 1280x960 (full camera resolution)

Processed resolution of pillar and repeater cameras: 640x480 (1/2x1/2 of camera’s true resolution)

all cameras: 3 color channels, 2 frames (2 frames also has very interesting implications)

(was 640x416, 2 color channels, 1 frame, only main and narrow in V8)
------------

Various V8 versions included networks for pillar and repeater cameras in the binaries but AFAIK nobody outside Tesla ever saw those networks in operation. Normal AP use on V8 seemed to only include the use of main and narrow for driving and the wide angle forward camera for rain sensing. In V9 it’s very clear that all cameras are being put to use for all the AP2 cars.

The basic camera NN (neural network) arrangement is an Inception V1 type CNN with L1/L2/L3ab/L4abcdefg layer arrangement (architecturally similar to V8 main/narrow camera up to end of inception blocks but much larger)
  • about 5x as many weights as comparable portion of V8 net
  • about 18x as much processing per camera (front/back)
The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3M. The V8 main camera was 640x416x2 or 0.5M - 13x less data.

For perspective, V9 camera network is 10x larger and requires 200x more computation when compared to Google’s Inception V1 network from which V9 gets it’s underlying architectural concept. That’s processing *per camera* for the 4 front and back cameras. Side cameras are 1/4 the processing due to being 1/4 as many total pixels. With all 8 cameras being processed in this fashion it’s likely that V9 is straining the compute capability of the APE. The V8 network, by comparison, probably had lots of margin.

network outputs:
  • V360 object decoder (multi level, processed only)
  • back lane decoder (back camera plus final processed)
  • side lane decoder (pillar/repeater cameras plus final processed)
  • path prediction pp decoder (main/narrow/fisheye cameras plus final processed)
  • “super lane” decoder (main/narrow/fisheye cameras plus final processed)

Previous V8 aknet included a lot of processing after the inception blocks - about half of the camera network processing was taken up by non-inception weights. V9 only includes inception components in the camera network and instead passes the inception processed outputs, raw camera frames, and lots of intermediate results to the post processing subsystem. I have not yet examined the post processing subsystem.

And now for some speculation:

Input changes:

The V9 network takes 1280x960 images with 3 color channels and 2 frames per camera from, for example, the main camera. That’s 1280x960x3x2 as an input, or 7.3MB. The V8 main camera processing frame was 640x416x2 or 0.5MB - 13x less data. The extra resolution means that V9 has access to smaller and more subtle detail from the camera, but the more interesting aspect of the change to the camera interface is that camera frames are being processed in pairs. These two pairs are likely time-offset by some small delay - 10ms to 100ms I’d guess - allowing each processed camera input to see motion. Motion can give you depth, separate objects from the background, help identify objects, predict object trajectories, and provide information about the vehicle’s own motion. It's a pretty fundamental improvement to the basic perceptions of the system.

Camera agnostic:

The V8 main/narrow network used the same architecture for both cameras, but by my calculation it was probably using different weights for each camera (probably 26M each for a total of about 52M). This make sense because main/narrow have a very different FOV, which means the precise shape of objects they see varies quite a bit - especially towards the edges of frames. Training each camera separately is going to dramatically simplify the problem of recognizing objects since the variation goes down a lot. That means it’s easier to get decent performance with a smaller network and less training. But it also means you have to build separate training data sets, evaluate them separately, and load two different networks alternately during operation. It also means that you network can learn some bad habits because it always sees the world in the same way.

Building a camera agnostic network relaxes these problems and simultaneously makes the network more robust when used on any individual camera. Being camera agnostic means the network has to have a better sense of what an object looks like under all kinds of camera distortions. That’s a great thing, but it’s very, *very* expensive to achieve because it requires a lot of training, a lot of training data and, probably, a really big network. Nobody builds them so it’s hard to say for sure, but these are probably safe assumptions.

Well, the V9 network appears to be camera agnostic. It can process the output from any camera on the car using the same weight file.

It also has the fringe benefit of improved computational efficiency. Since you just have the one set of weights you don’t have to constantly be swapping weight sets in and out of your GPU memory and, even more importantly, you can batch up blocks of images from all the cameras together and run them through the NN as a set. This can give you a multiple of performance from the same hardware.

I didn’t expect to see a camera agnostic network for a long time. It’s kind of shocking.

Considering network size:

This V9 network is a monster, and that’s not the half of it. When you increase the number of parameters (weights) in an NN by a factor of 5 you don’t just get 5 times the capacity and need 5 times as much training data. In terms of expressive capacity increase it’s more akin to a number with 5 times as many digits. So if V8’s expressive capacity was 10, V9’s capacity is more like 100,000. It’s a mind boggling expansion of raw capacity. And likewise the amount of training data doesn’t go up by a mere 5x. It probably takes at least thousands and perhaps millions of times more data to fully utilize a network that has 5x as many parameters.

This network is far larger than any vision NN I’ve seen publicly disclosed and I’m just reeling at the thought of how much data it must take to train it. I sat on this estimate for a long time because I thought that I must have made a mistake. But going over it again and again I find that it’s not my calculations that were off, it’s my expectations that were off.

Is Tesla using semi-supervised training for V9? They've gotta be using more than just labeled data - there aren't enough humans to label this much data. I think all those simulation designers they hired must have built a machine that generates labeled data for them, but even so.

And where are they getting the datacenter to train this thing? Did Larry give Elon a warehouse full of TPUs?

I mean, seriously...

I look at this thing and I think - oh yeah, HW3. We’re gonna need that. Soon, I think.

Omnidirectionality (V360 object decoder):

With these new changes the NN should be able to identify every object in every direction at distances up to hundreds of meters and also provide approximate instantaneous relative movement for all of those objects. If you consider the FOV overlap of the cameras, virtually all objects will be seen by at least two cameras. That provides the opportunity for downstream processing use multiple perspectives on an object to more precisely localize and identify it.

General thoughts:

I’ve been driving V9 AP2 for a few days now and I find the dynamics to be much improved over recent V8. Lateral control is tighter and it’s been able to beat all the V8 failure scenarios I’ve collected over the last 6 months. Longitudinal control is much smoother, traffic handling is much more comfortable. V9’s ability to prospectively do a visual evaluation on a target lane prior to making a change makes the auto lane change feature a lot more versatile. I suspect detection errors are way down compared to V8 but I also see that a few new failure scenarios have popped up (offramp / onramp speed control seem to have some bugs). I’m excited to see how this looks in a couple of months after they’ve cleaned out the kinks that come with any big change.

Being an avid observer of progress in deep neural networks my primary motivation for looking at AP2 is that it’s one of the few bleeding edge commercial applications that I can get my hands on and I use it as a barometer of how commercial (as opposed to research) applications are progressing. Researchers push the boundaries in search of new knowledge, but commercial applications explore the practical ramifications of new techniques. Given rapid progress in algorithms I had expected near future applications might hinge on the great leaps in efficiency that are coming from new techniques. But that’s not what seems to be happening right now - probably because companies can do a lot just by scaling up NN techniques we already have.

In V9 we see Tesla pushing in this direction. Inception V1 is a four year old architecture that Tesla is scaling to a degree that I imagine inceptions’s creators could not have expected. Indeed, I would guess that four years ago most people in the field would not have expected that scaling would work this well. Scaling computational power, training data, and industrial resources plays to Tesla’s strengths and involves less uncertainty than potentially more powerful but less mature techniques. At the same time Tesla is doubling down on their ‘vision first / all neural networks’ approach and, as far as I can tell, it seems to be going well.

As a neural network dork I couldn’t be more pleased.
 
Yeah, finally we got to see the architecture of the AKNET_V9 network. And it was very similar to what green was guess. Not sure if it is totally camera agnostic, but at least some layers are shared, then layers for groups of sensors and then task specific layers.
 
Does anyone know how far Tesla has gotten on the following NN's? I've marked in parentheses what I think the status is based on public information. Is this accurate?

Lane Markings (complete)
3D Vehicles (complete)
Hazards (?)
Free Space (complete)
Road Marking (?)
Path Prediction (complete)
Road Signs (?)
General Objects (?)
Pedestrians (complete)
Traffic Lights (in development)
Road Edges (in development)
Traffic Signs (in development)
Path Delimiters (?)

The ? are the ones I am not sure about.

Thanks.
 
Does anyone know how far Tesla has gotten on the following NN's? I've marked in parentheses what I think the status is based on public information. Is this accurate?

Lane Markings (complete)
3D Vehicles (complete)
Hazards (?)
Free Space (complete)
Road Marking (?)
Path Prediction (complete)
Road Signs (?)
General Objects (?)
Pedestrians (complete)
Traffic Lights (in development)
Road Edges (in development)
Traffic Signs (in development)
Path Delimiters (?)

The ? are the ones I am not sure about.

Thanks.
I don't think we can say any of them are complete for city driving. For eg. lane markings - what about faded lane markings or lane markings on the road to which the car needs to turn ? Similarly path prediction of other cars in an intersection. Same for free space - they are now distinguishing between road and lawn. BTW, what do you mean by 3D vehicles ?

Basically, they had done all these for freeway NOA - and they work on some city roads incidentally. They have to make all these work on city roads by training on city specific data. I think that is the main work going on now.

ps : The way agile teams normally work is to concentrate on getting the feature working. If the feature is freeway NOA - they won't do work that won't be encountered on freeway (like path prediction of vehicles turning or roundabouts or traffic lights or unmarked roads). In other words, the task is not to solve "path prediction" generically, but confined to within the feature being delivered.

They are probably working on city NOA for 6 months now. So, they seem to have completed some things like pedestrians, bicycles. But, because the production s/w got hacked, not sure how much of the dev code is making it to the production code any more, even parts that are complete.
 
  • Helpful
Reactions: diplomat33
I don't think we can say any of them are complete for city driving. For eg. lane markings - what about faded lane markings or lane markings on the road to which the car needs to turn ? Similarly path prediction of other cars in an intersection. Same for free space - they are now distinguishing between road and lawn. BTW, what do you mean by 3D vehicles ?

Basically, they had done all these for freeway NOA - and they work on some city roads incidentally. They have to make all these work on city roads by training on city specific data. I think that is the main work going on now.

ps : The way agile teams normally work is to concentrate on getting the feature working. If the feature is freeway NOA - they won't do work that won't be encountered on freeway (like path prediction of vehicles turning or roundabouts or traffic lights or unmarked roads). In other words, the task is not to solve "path prediction" generically, but confined to within the feature being delivered.

They are probably working on city NOA for 6 months now. So, they seem to have completed some things like pedestrians, bicycles. But, because the production s/w got hacked, not sure how much of the dev code is making it to the production code any more, even parts that are complete.

Thanks. That is helpful.

I assume "3D Vehicles" just means recognizing vehicles.
 
I would not consider 3D vehicles complete. The car still only sees in 2d as evidenced by the fact it only displays cars facing in a single direction unless they are spinning like tops. I think they are trying to use 2d turned vehicles to simulate a 3D space, but they need to get a move on here. No chance there is anything remotely close to feature complete by year end if basic image recognition is still so far behind.
 
I would not consider 3D vehicles complete. The car still only sees in 2d as evidenced by the fact it only displays cars facing in a single direction unless they are spinning like tops. I think they are trying to use 2d turned vehicles to simulate a 3D space, but they need to get a move on here. No chance there is anything remotely close to feature complete by year end if basic image recognition is still so far behind.
@verygreen 's videos have shown 3D bounding boxes around vehicles for a while.
 
I’ve seen that, but it just doesn’t feel like it based on what the car is actually doing and displaying. There’s a big disconnect somewhere. Maybe the technical capability exists but the accuracy rate is very very low. Like sub 10% accuracy.
IIRC someone (might have been green) some time ago said that the on screen visualization is using data output from somewhere in the middle of the NN process, not the final output. If so, then that is presumably is why it's so terrible with spastic cars. However that was stated a long time ago (might be more than a year now) and I haven't heard anything since so I don't know if that's still the case, and of course being it was coming from a hacker rather than someone at Tesla there's also never been any confirmation that this was the case initially.
 
  • Like
Reactions: Sharps97
IIRC someone (might have been green) some time ago said that the on screen visualization is using data output from somewhere in the middle of the NN process, not the final output. If so, then that is presumably is why it's so terrible with spastic cars. However that was stated a long time ago (might be more than a year now) and I haven't heard anything since so I don't know if that's still the case, and of course being it was coming from a hacker rather than someone at Tesla there's also never been any confirmation that this was the case initially.
Does anyone know how the visualization worked in autonomy day demo cars ? Were the objects stable ?