Tesla autopilot HW3

verygreen · Jan 4, 2019

For the past few months Tesla has been slowly sharing details of its upcoming “Hardware 3” (HW3) changes soon to be introduced into its S/X/3 lineup. Tesla has stated that cars will begin to be built with the new computer sometime in the first half of 2019, and they have said that this is a simple computer upgrade, with all vehicle sensors (radar, ultrasonics, cameras) staying the same.

Today we have some information about what HW3 actually will (and won’t) be:

What do we know about the Tesla’s upcoming HW3? We actually know quite a bit now thanks to Tesla’s latest firmware. The codename of the new HW3 computer is “TURBO”.

Hardware:

We believe the new hardware is based on Samsung Exynos 7xxx SoC, based on the existence of ARM A72 cores (this would not be a super new SoC, as the Exynos SoC is about an Oct 2015 vintage). HW3 CPU cores are clocked at 1.6GHz, with a MALI GPU at 250MHz and memory speed 533MHz.

HW3 architecture is similar to HW2.5 in that there are two separate compute nodes (called “sides”): the “A” side that does all the work and the “B” side that currently does not do anything.

Also, it appears there are some devices attached to this SoC. Obviously, there is some emmc storage, but more importantly there’s a Tesla PCI-Ex device named “TRIP” that works as the NN accelerator. The name might be an acronym for “Tensor <something> Inference Processor”. In fact, there are at least two such “TRIP” devices, and maybe possibly two per “side”.

As of mid-December, this early firmware’s state of things were in relative early bring-up. No actual autopilot functionality appears included yet, with most of the code just copied over from existing HW2.5 infrastructure. So far all the cameras seem to be the same.

It is running Linux kernel 4.14 outside of the usual BuildRoot 2 environment.

In reviewing the firmware, we find descriptions of quite a few HW3 board revisions already (8 of them actually) and hardware for model 3 and S/X are separate versions too (understandably).

The “TRIP” device obviously is the most interesting one. A special firmware that encompasses binary NN (neural net) data is loaded there and then eventually queried by the car vision code. The device runs at 400MHz. Both “TRIP” devices currently load the same NNs, but possibly only a subset is executed on each?

With the Exynos SoC being a 2015 vintage and in consideration of comments made by Peter Bannon on the Q2 2018 earnings call, (he said “three years ago when I joined Tesla we did a survey of all of the solutions” = 2nd half of 2015), does this look like the current HW2/HW2.5 NVIDIA autopilot units were always viewed as a stop-gap and hence the lack of perceived computation power everybody was accusing Tesla of at the time of AP2 release was not viewed as important by Tesla?

SOFTWARE:

In reviewing the binaries in this new firmware, @DamianXVI was able to work out a pretty good idea of what the “TRIP” coprocessor does on HW3 (he has an outstanding ability to look at and interpret binary data!):

The “TRIP” software seems to be a straight list of instructions aligned to 32 bytes (256 bits). Programs operate on two types of memory, one for input/output and one for working memory. The former is likely system DRAM and the latter internal SRAM.
Memory operations include data loading, weight loading, and writing output. Program operations are pipelined with data loads and computations interleaved and weight fetching happening well upstream from the instructions that actually use those weights. Weights seem to be compressed from the observation that they get copied to an internal region that is substantially larger than the source region with decompression/unpacking happening as part of the weight loading operation. Intermediate results are kept in working memory with only final results being output to shared memory.
Weights are loaded from shared memory into working memory and maintained in a reserved slot which is referenced by number in processing instructions. Individual processing instructions reference input, output, and weights in working memory. Some processing instructions do not reference weights and these seem to be pooling operations.

@DamianXVI created graphical visualizations of this data flow for some of the networks observed in the binaries. This is not a visualization of the network architecture, it is a visualization of instructions and their data dependencies. In these visualizations, green boxes are data load/store. White boxes is weights load. Blue are computation instructions with weights, red and orange are computation blocks without weights. Black links show output / input overlapping between associated processing operations. Blue links connect associated weight data. These visualizations are representative of a rough and cursory understanding of the data flow. Obviously, it is likely many links are missing and some might be wrong. Regardless, you can see the complexity being introduced with these networks.

What is very interesting is that @DamianXVI concluded that these visualizations look like GoogleNet. At the outset, he did not work with the intention to see if Tesla’s architecture was similar to GoogleNet; he hadn’t even seen GoogleNet before, but as he assembled the visualization the similarities appeared.

After understanding the new hardware and NN architecture a bit, we then asked @jimmy_d to comment and here’s what he has to say:

“Damian’s analysis describes exactly what you’d want in an NN processor. A small number of operations that distill the essence of processing a neural network: load input from shared memory/ load weights from shared memory / process a layer and save results to on-chip memory / process the next layer … / write the output to shared memory. It does the maximum amount of work in hardware but leaves enough flexibility to efficiently execute any kind of neural network.

And thanks Damian’s heroic file format analysis I was able to take a look at some neural network dataflow diagrams and make some estimates of what the associate HW3 networks are doing. Unfortunately, I didn’t find anything to get excited about. The networks I looked at are probably a HW3 compatible port of the networks that are currently running on HW2.

What I see is a set of networks that are somewhat refined compared to earlier versions, but basically the same inputs and outputs and small enough that they can run on the GPU in HW2. So still no further sightings of “AKNET_V9”: the unified, multi frame, camera agnostic architecture that I got a glimpse of last year. Karpathy mentioned on the previous earnings call that Tesla already had bigger networks with better performance that require HW3 to run. What I’ve seen so far in this new HW3 firmware is not those networks.

What we know about the HW3 NN processor right now is pretty limited. Apparently there are two “TRIP” units which seem to be organized as big matrix multipliers with integrated accumulators, nonlinear operators, and substantial integrated memory for storing layer activations. Additionally it looks like weight decompression is implemented in hardware. This is what I get from looking at the primitives in the dataflow and considering what it would take to implement them in hardware. Two big unknowns at the moment are the matrix multiplier size and the onboard memory size. That, plus the DRAM I/O bus width, would let us estimate the performance envelope. We can do a rough estimate as follows:

Damian’s analysis shows a preference for 256 byte block sizes in the load/store instructions. If the matrix multiplier input bus is that width then it suggests that the multiplier is 256xN in size. There are certain architectural advantages to being approximately square, so let’s assume 256x256 for the multiplier size and that it operates at one operation per clock at @verygreen’s identified clock rate of 400MHz. That gives us 26TMACs per second, which is 52Tops per second (a MAC is one multiply and one add which equals two operations). So one TRIP would give us 52Tops and two of them would give us 104Tops. This is assuming perfect utilization. Actual utilization is unlikely to be higher than 95% and probably closer to 75%. Still, it’s a formidable amount of processing for neural network applications. Lets go with 75% utilization, which gives us 40Tops per TRIP or 80Tops total.

As a point of reference - Google’s TPU V1, which is the one that Google uses to actually *run* neural networks (the other versions are optimized for training) is very similar to the specs I’ve outlined above. From Google’s published data on that part we can tell that the estimates above are reasonable - probably even conservative. Google’s part is 700MHz and benchmarks at 92Tops peak in actual use processing convolutional neural networks. That is the same kind of neural network used by Tesla in autopilot. One likely difference is going to be onboard memory - Google’s TPU has 27MB but Tesla would likely want a lot more than that because they want to run much heavier layers than the ones that the TPU was optimized for. I’d guess they need at least 75MB to run AKNET_V9. All my estimates assume they have budgeted enough onboard SRAM to avoid having to dump intermediate results back to DRAM - which is probably a safe bet.

With that performance level, the HW3 neural nets that I see in this could be run at 1000 frames per second (all cameras simultaneously). This is massive overkill. There’s little reason to run much faster than 40fps for a driving application. The previously noted AKNET_V9 “monster” neural network requires something like 600 billion MACs to process one frame. So a single “TRIP”, using the estimated performance above, could run AKNET_V9 at 66 frames per second. This is closer to the sort of performance that would make sense and AKNET_V9 would be about the size of network one would expect to see running on the trip given the above assumptions.”

verygreen · Jan 4, 2019

Duh, of course there needed to be a typo in title that I totally missed and cannot edit apparently?

Anner J. Bonilla · Jan 4, 2019

Great writeup.

Quick question.

This trip firmware is similar to a FPGA image? being sent to the TRIP device over PCIe? Is it any particular recognized architecture format ie. bytestream for a xiling fpga or altera?

verygreen · Jan 4, 2019

Anner J. Bonilla said:
Great writeup.

Quick question.

This trip firmware is similar to a FPGA image? being sent to the TRIP device over PCIe? Is it any particular recognized architecture format ie. bytestream for a xiling fpga or altera?

it's not an FPGA image. It's really a NN "code" + "data" if you will.

BigD0g · Jan 4, 2019

You can report the post and ask to have the subject fixed. Great writeup very informative!

lunitiks · Jan 4, 2019

Those charts by D-man!

mongo · Jan 4, 2019

verygreen said:
Weights seem to be compressed from the observation that they get copied to an internal region that is substantially larger than the source region with decompression/unpacking happening as part of the weight loading operation.

Could this be running multiple copies of the kernel in parallel? 16x16 = 256 weights, repeated multiple times to do a few regions at once? So 1024 locations runs the same kernel 4 times.

The duplicate processor may be for safety redundancy.

BigD0g · Jan 4, 2019

mongo said:
Could this be running multiple copies of the kernel in parallel? 16x16 = 256 weights, repeated multiple times to do a few regions at once? So 1024 locations runs the same kernel 4 times.

The duplicate processor may be for safety redundancy.

There appears to be an A side and a B side ala HW 2.5, so they can be running both if they wanted, but it appears from the post that it's just like 2.5 where the A side does everything and the B side just sits there.

mongo · Jan 4, 2019

BigD0g said:
There appears to be an A side and a B side ala HW 2.5, so they can be running both if they wanted, but it appears from the post that it's just like 2.5 where the A side does everything and the B side just sits there.

Right, so sized for one half and not spending development effort on the redundancy side of the code. For FSD, run the same code on both sides, if they disagree, fault and fail-safe.

MorrisonHiker · Jan 4, 2019

verygreen said:
We believe the new hardware is based on Samsung Exynos 7xxx SoC, based on the existence of ARM A72 cores (this would not be a super new SoC, as the Exynos SoC is about an Oct 2015 vintage). HW3 CPU cores are clocked at 1.6GHz, with a MALI GPU at 250MHz and memory speed 533MHz.

Does that mean the "Tesla chip" is really from Samsung or is this a different processor altogether? I thought Tesla was working on their own chip design.

chillaban · Jan 4, 2019

MorrisonHiker said:
Does that mean the "Tesla chip" is really from Samsung or is this a different processor altogether? I thought Tesla was working on their own chip design.

The Tesla PCIe accelerator is probably their own design. They likely did not design their own ARM SoCs for the part that runs Linux and the control algorithm / controls the NN accelerator, which makes sense. It takes a lot of resources to design your own CPUs and SoC (and then bring up an OS on it), and I don't see a clear reason why Tesla benefits from that versus just taking the lowest competitive bid on some sort of SoC that runs Linux.

verygreen · Jan 4, 2019

MorrisonHiker said:
Does that mean the "Tesla chip" is really from Samsung or is this a different processor altogether? I thought Tesla was working on their own chip design.

the TRIP chip is the Tesla chip (they still contract some fab to make it, likely Samsung sicne they switched to their SoC too?) it's not a general purpose CPU, more like a Google TPU, very special purpose device.

KyleDay · Jan 4, 2019

Very interesting that the latest firmware files are a port of EAP onto HW3 and don't yet yield any insight into the mysterious "AKNET_V9" all-seeing eye stuff. I wonder when/if we'll start to see those NNs be deployed.

Bladerskb · Jan 4, 2019

Without reading the post at all and only looking at just the picture. i just wanted to say that non-stacked convolution layers is pretty standard in the world of CNN nowadays, it all started with the first inception network from google which Google really showed that you don't have to just stack conv layers with pooling, drops, on top of each other, but that you can get clever with it. That and using smaller conv filters there is nothing mind blowing about it today. its pretty standard.

Here is Inception v1 from 2014

EDIT: After reading the post. My comments doesn't change. Only thing i would add is that I'm surprised people still take anything jimmy_d says seriously.

jimmy_d · Jan 4, 2019

verygreen said:
the TRIP chip is the Tesla chip (they still contract some fab to make it, likely Samsung sicne they switched to their SoC too?) it's not a general purpose CPU, more like a Google TPU, very special purpose device.

How about "Tesla Redundant Inference Processor" for trip?

Engr · Jan 4, 2019

Bladerskb said:
Without reading the post at all and only looking at just the picture. i just wanted to say that non-stacked convolution layers is pretty standard in the world of CNN nowadays, it all started with the first inception network from google which Google really showed that you don't have to just stack conv layers on top of each other, but that you can get clever with it. there is nothing mind blowing about it today. its standard.

Here is GoogleNet from 2014

Yhh. I like your sniky comment of "too arrongant to read the post that mentioned the exact comparison."

Keep it up

shiny.sky.00 · Jan 4, 2019

How did you conclude it was an Exynos SoC? There is no Exynos SoC with the Cortex A72 core.

Bladerskb · Jan 4, 2019

Engr said:
Yhh. I like your sniky comment of "too arrongant to read the post that mentioned the exact comparison."

Keep it up

I always like @verygreen posts because they are filled with actual evidence, I'm currently at work so i couldn't read it. But I wanted to make a quick reply as a placeholder so i just glanced at the pictures. It ended up being they were talking about the similarities with Inception network. When I initially reviewed the post i saw the mind blown gif and I thought @verygreen posted that. So it lead to my response.

But after reading the post my only conclusion is that I wish he hadn't included jimmy's post in it and will not thumb it up because of that. As I'm surprised people still take anything jimmy_d says seriously.

verygreen · Jan 4, 2019

jimmy_d said:
How about "Tesla Redundant Inference Processor" for trip?

I'd go with "Totally Redunkulous Inference Processor" in that case

verygreen · Jan 4, 2019

shiny.sky.00 said:
How did you conclude it was an Exynos SoC? There is no Exynos SoC with the Cortex A72 core.

there is: Samsung Exynos 7650 And 7800 With Cortex-A72 CPU Core And Mali-T860 GPU To Power 2016 Galaxy Phones: Report

I have the kernel, and I see it is targeting Samsung exynos as the platform.

Tesla autopilot HW3

Curious member

Curious member

Member

Curious member

Active Member

Cool James & Black Teacher

Well-Known Member

Active Member

Well-Known Member

Well-Known Member

Active Member

Curious member

Active Member

Senior Software Engineer

Deep Learning Dork

Member

New Member

Senior Software Engineer

Curious member

Curious member

Similar threads