Welcome to Tesla Motors Club
Discuss Tesla's Model S, Model 3, Model X, Model Y, Cybertruck, Roadster and More.
Register

Tesla autopilot HW3

This site may earn commission on affiliate links.
Well, of course there's no activity, there's no new material, and we were hoping you would add some!

Also on an unrelated note I see you were not driving your model 3 on that fateful day? Why not?

One of the things i was going to talk about included debunking the "three years ahead in autonomous hardware" myth by almost every Tesla fan i talk it. Some even say 10 years... Its mind-boggling. Of-course by using facts and showcasing the immense list of neural net processors in the market.

A List of Chip/IP for Deep Learning – Shan Tang – Medium (this is a outdated list as of 2017)

Another is a walk-through of building a CNN to debunk the "camera agnostic" myth created by @jimmy_d.
Also explaining how easy the math behind deducing details such as memory requirement and the amount of images/data used to train a net through looking at details like hyper-parameters, neurons, layers, filters, size of inputs. All the basic stuff that @jimmy_d gets worshiped for.

Also on an unrelated note I see you were not driving your model 3 on that fateful day? Why not?
I had a model 3 reservation back in 2016 which i dropped once i realized the 35k car wouldn't be made and that FSD was a fundraising scheme.

My next car is either the NIO Eve in 2020 (which will come with an EyeQ5) or the BMW iNext in 2021. But I will be making tons of videos and comparisons with a friend who has a Tesla so look forward to that in 2020.
 
Last edited:
One of the things i was going to talk about included debunking the "three years ahead in autonomous hardware" myth by almost every Tesla fan i talk it. Some even say 10 years... Its mind-boggling. Of-course by using facts and showcasing the immense list of neural net processors in the market.

Another is a walk-through of building a CNN to debunk the "camera agnostic" myth created by @jimmy_d.
Also explaining how easy the math behind deducing details such as memory requirement and the amount of images/data used to train a net through looking at details like hyper-parameters, neurons, layers, filters, size of inputs. All the basic stuff that @jimmy_d gets worshiped for.
That all would still be relevant. esp. now that Tesla patents on the thing surfaced. If you have better data - why hold it to yourself?

I had a model 3 reservation back in 2016 which i dropped once i realized the 35k car wouldn't be made and that FSD was a fundraising scheme.
Hm, I think you were listed as a model3 owner at the teslamotors reddit?
 
  • Like
  • Informative
Reactions: dhanson865 and Joel
I had a look through the recent Tesla patent release.

There's one main patent which was filed for in Sept 2017 and 3 extensions filed six months later. The main one is titled "Accelerated Mathematical Engine" - I'll attach the pdf here if I can. The others are supplemental and offer little additional insight into the main elements.

The 'Accelerated Mathematical Engine' is a large matrix multiplier which is fed by pipelined data formatting engines which are intended to keep it busy. The patent explicitly calls out the design as focused on performing convolutions for neural networks - especially for image processing. Details of the management system imply that multiple of these engines would be employed as a group for enhanced flexibility, which we can infer as meaning there would be multiple of them on a single IC. There are a lot of details about how the AME is organized internally and some details about how the formatting engines work, supplemented by more detail in the other 3 patents. I'll save you all the grief of a long discussion here and just focus on the highlights.

First - the AME description is a perfect match for what @DamianXVI found in the firmware provided by @verygreen, right down to the multiplier granularity and the control scheme so I think this patent probably covers what Tesla actually put into the HW3 TRIP and not some loosely related early version of the design concept. So that's nice.

Second - the AME design is optimized for the kind of calculation that is exclusively used in all the camera networks we've seens so far - they are 100% CNN with fine granularity, high input channel count and high output channel count. AME should run these networks with extremely high utilization rates (over 90%).

Third - while the basic idea of a big matrix MAC (multiply-accumulate) engine that operates on 8 bit fixed point values is the same as what the TPU V1 uses, the AME is not quite the same as a TPU. The TPU is explicitly systolic, keeps weights resident in the array, and feeds input channels and reads output channels synchronously. The AME (accelerated mathematical engine) accumulates output channel results in the matrix and feeds weights and input channels synchronously. So while the goal and concept are similar, the implementation of the TPU and the AME are quite different in their details. My read here is that the TPU went the route that optimized for power efficiency and the AME is more about ease of use and flexibility. That makes a lot of sense when you consider that, to Google, power and heat management are primary utility criteria. For an in vehicle application you care less about unit power consumption and more about future-proofing and fewer code bugs.

AME's engine size seems to be 96x96, which is close to the 96x128 optimum granularity that AKNET_V9 would want on a weight resident systolic system like the TPU. AME is more flexible and may not have those kind of constraints - I haven't redone the analysis for AME's dataflow yet (patent language tends to obscure the kind of detail that I need). In any case, AME appears to be an excellent fit for running the biggest net which we've seen in the firmware - the only one we've seen which is probably HW3 specific.

A single AME will perform 2*96^2 operations per clock - about 20K. The maximum clock rate for the AME is probably in the 500MHz to 2GHz range, with the lower end of that being more likely. If we go with 500MHz you get about 10TOPs / second per AME. I think AKNET_V9 in actual vehicle use probably needs at least a half dozen AMEs at that speed. It would be less if Tesla manages a higher internal clock speed, but I think they probably won't go over 1GHz even if they can. The efficiency hit that you get at the higher clock rates probably adds more risk and system cost than they would save at the IC level from the reduced die area. Additionally they are not real estate constrained (unlike a cellphone or a datacenter) so a bigger, slower IC designed at a mature process node might well be a good tradeoff for them. This chip should be economical at 28nm or lower.

Anyway - I think this all points toward HW3 being able to realize 50 to 100 TOPs. That's about 10x what HW2 has. It's more than enough to allow for a substantial expansion of what the neural networks in AP2 can do.

edit - typo fix
 
Last edited:
Noice, looking forward to it.

This won't be happening tonight as planned. if you haven't noticed already... i have disrobed myself.
Everyone can now gaze at my magnificence if you know where to look.
I'm now thinking of starting a full featured production level video series on sdc rather than waiting till 2020 and this will be the pilot
 
Last edited:
This won't be happening tonight as planned. if you haven't noticed already... i have disrobed myself.
Everyone can now gaze at my magnificence if you know where to look.
I'm now thinking of starting a full featured production level video series on sdc rather than waiting till 2020 and this will be the pilot

At this point you've offered and backed down so many times, and only offered snarky criticisms about the attention people are getting. I'm just going to assume you're trolling and ignore anything you say until you offer something of substance.
 
The vector computational unit looks interesting too - it looks to be similar to the AME in the patent and is something assumed to have been added to Google's TPUv2 to add flexibility.

The vector unit in the patents is an accessory to the AME which expands the range of neural network layer operations that can be efficiently implemented with the AME. The AME design probably incorporates this vector unit in it's basic conception, as well as the various control systems and formatters and whatnot that are covered in the supporting patents. I read the 3 extension patents as a follow-on exercise that identifies more elements of the original NN processor design as being potentially amenable to IP protection rather than independent innovations.

A large matrix processor by itself doesn't allow for efficient computation of small convolution operations (3x3, 5x5 or so) - the convolution operations have to be unrolled and the data rearranged before feeding the matrix multiplier so that it's simple cell-adjacent operations can manage the work while still keeping the full multiplier array busy. Input data has to be sliced into array-sized (96xN) chunks, run through the processor, and output will come in waves as the last rows of input data enter the matrix. Shadow registers allow the matrix processor to overlap the end of one operation with the start of the next operation to improve utilization, but all this data rearrangement, I/O staging, and collating of results requires some pretty complicated choreography if maximum performance is to be extracted.

Depending on the dimensions of the tensor being processed (number of input channels, number of output channels, input vector length, kernel size size and type) and the ordering of the input data that resulted from the preceding layer operations a variety of different packing arrangements are needed and some of those benefit from inter-column operations that can be provided at the output of the AME by a vector processor. In CPUs and GPUs these data handling operations are managed in software. NN processors like the AME will opt for doing the data handling and formatting with dedicated hardware controllers on the chip to reduce latency and power consumption and allow for increased clock rate and reduced instruction cycles. This can be tricky to get right, and doing everything in hardware risks the possibility of not being able to run future algorithms. So adding elements like the vector processor increase flexibility while keeping the most expensive parts of data management in hardware.

Tesla's NN processor needs to be fast and efficient, but it also needs a certain amount of future proofing and the ability to handle novel NN architectures. These supporting innovations provide a balance between fast hardware and flexibility for addressing future needs.
 
Based on what we know so far, HW3 is deployed in all Model3s from Jan 9th, 2019. But we'll see if the earnings call today will confirm it or not.

I highly doubt this. HW3 (which presumably is a significant cost increase over HW2) would be a completely wasted expense on cars without pre-purchased FSD, which is the vast majority of them. Where are you getting this info?