TMC is an independent, primarily volunteer organization that relies on ad revenue to cover its operating costs. Please consider whitelisting TMC on your ad blocker and becoming a Supporting Member. For more info: Support TMC

Tesla autopilot HW3

Discussion in 'Autopilot & Autonomous/FSD' started by verygreen, Jan 4, 2019.

  1. verygreen

    verygreen Curious member

    Joined:
    Jan 16, 2017
    Messages:
    2,418
    Location:
    TN
    For the past few months Tesla has been slowly sharing details of its upcoming “Hardware 3” (HW3) changes soon to be introduced into its S/X/3 lineup. Tesla has stated that cars will begin to be built with the new computer sometime in the first half of 2019, and they have said that this is a simple computer upgrade, with all vehicle sensors (radar, ultrasonics, cameras) staying the same.

    Today we have some information about what HW3 actually will (and won’t) be:

    What do we know about the Tesla’s upcoming HW3? We actually know quite a bit now thanks to Tesla’s latest firmware. The codename of the new HW3 computer is “TURBO”.

    Hardware:

    We believe the new hardware is based on Samsung Exynos 7xxx SoC, based on the existence of ARM A72 cores (this would not be a super new SoC, as the Exynos SoC is about an Oct 2015 vintage). HW3 CPU cores are clocked at 1.6GHz, with a MALI GPU at 250MHz and memory speed 533MHz.

    HW3 architecture is similar to HW2.5 in that there are two separate compute nodes (called “sides”): the “A” side that does all the work and the “B” side that currently does not do anything.

    Also, it appears there are some devices attached to this SoC. Obviously, there is some emmc storage, but more importantly there’s a Tesla PCI-Ex device named “TRIP” that works as the NN accelerator. The name might be an acronym for “Tensor <something> Inference Processor”. In fact, there are at least two such “TRIP” devices, and maybe possibly two per “side”.

    As of mid-December, this early firmware’s state of things were in relative early bring-up. No actual autopilot functionality appears included yet, with most of the code just copied over from existing HW2.5 infrastructure. So far all the cameras seem to be the same.

    It is running Linux kernel 4.14 outside of the usual BuildRoot 2 environment.

    In reviewing the firmware, we find descriptions of quite a few HW3 board revisions already (8 of them actually) and hardware for model 3 and S/X are separate versions too (understandably).

    The “TRIP” device obviously is the most interesting one. A special firmware that encompasses binary NN (neural net) data is loaded there and then eventually queried by the car vision code. The device runs at 400MHz. Both “TRIP” devices currently load the same NNs, but possibly only a subset is executed on each?

    With the Exynos SoC being a 2015 vintage and in consideration of comments made by Peter Bannon on the Q2 2018 earnings call, (he said “three years ago when I joined Tesla we did a survey of all of the solutions” = 2nd half of 2015), does this look like the current HW2/HW2.5 NVIDIA autopilot units were always viewed as a stop-gap and hence the lack of perceived computation power everybody was accusing Tesla of at the time of AP2 release was not viewed as important by Tesla?

    SOFTWARE:

    In reviewing the binaries in this new firmware, @DamianXVI was able to work out a pretty good idea of what the “TRIP” coprocessor does on HW3 (he has an outstanding ability to look at and interpret binary data!):

    The “TRIP” software seems to be a straight list of instructions aligned to 32 bytes (256 bits). Programs operate on two types of memory, one for input/output and one for working memory. The former is likely system DRAM and the latter internal SRAM.
    Memory operations include data loading, weight loading, and writing output. Program operations are pipelined with data loads and computations interleaved and weight fetching happening well upstream from the instructions that actually use those weights. Weights seem to be compressed from the observation that they get copied to an internal region that is substantially larger than the source region with decompression/unpacking happening as part of the weight loading operation. Intermediate results are kept in working memory with only final results being output to shared memory.
    Weights are loaded from shared memory into working memory and maintained in a reserved slot which is referenced by number in processing instructions. Individual processing instructions reference input, output, and weights in working memory. Some processing instructions do not reference weights and these seem to be pooling operations.

    @DamianXVI created graphical visualizations of this data flow for some of the networks observed in the binaries. This is not a visualization of the network architecture, it is a visualization of instructions and their data dependencies. In these visualizations, green boxes are data load/store. White boxes is weights load. Blue are computation instructions with weights, red and orange are computation blocks without weights. Black links show output / input overlapping between associated processing operations. Blue links connect associated weight data. These visualizations are representative of a rough and cursory understanding of the data flow. Obviously, it is likely many links are missing and some might be wrong. Regardless, you can see the complexity being introduced with these networks.

    Network0_Main.png Network1_Narrow.png Network3_Pillar.png Network4_Repeater.png

    What is very interesting is that @DamianXVI concluded that these visualizations look like GoogleNet. At the outset, he did not work with the intention to see if Tesla’s architecture was similar to GoogleNet; he hadn’t even seen GoogleNet before, but as he assembled the visualization the similarities appeared.

    After understanding the new hardware and NN architecture a bit, we then asked @jimmy_d to comment and here’s what he has to say:

    “Damian’s analysis describes exactly what you’d want in an NN processor. A small number of operations that distill the essence of processing a neural network: load input from shared memory/ load weights from shared memory / process a layer and save results to on-chip memory / process the next layer … / write the output to shared memory. It does the maximum amount of work in hardware but leaves enough flexibility to efficiently execute any kind of neural network.

    And thanks Damian’s heroic file format analysis I was able to take a look at some neural network dataflow diagrams and make some estimates of what the associate HW3 networks are doing. Unfortunately, I didn’t find anything to get excited about. The networks I looked at are probably a HW3 compatible port of the networks that are currently running on HW2.

    What I see is a set of networks that are somewhat refined compared to earlier versions, but basically the same inputs and outputs and small enough that they can run on the GPU in HW2. So still no further sightings of “AKNET_V9”: the unified, multi frame, camera agnostic architecture that I got a glimpse of last year. Karpathy mentioned on the previous earnings call that Tesla already had bigger networks with better performance that require HW3 to run. What I’ve seen so far in this new HW3 firmware is not those networks.

    What we know about the HW3 NN processor right now is pretty limited. Apparently there are two “TRIP” units which seem to be organized as big matrix multipliers with integrated accumulators, nonlinear operators, and substantial integrated memory for storing layer activations. Additionally it looks like weight decompression is implemented in hardware. This is what I get from looking at the primitives in the dataflow and considering what it would take to implement them in hardware. Two big unknowns at the moment are the matrix multiplier size and the onboard memory size. That, plus the DRAM I/O bus width, would let us estimate the performance envelope. We can do a rough estimate as follows:

    Damian’s analysis shows a preference for 256 byte block sizes in the load/store instructions. If the matrix multiplier input bus is that width then it suggests that the multiplier is 256xN in size. There are certain architectural advantages to being approximately square, so let’s assume 256x256 for the multiplier size and that it operates at one operation per clock at @verygreen’s identified clock rate of 400MHz. That gives us 26TMACs per second, which is 52Tops per second (a MAC is one multiply and one add which equals two operations). So one TRIP would give us 52Tops and two of them would give us 104Tops. This is assuming perfect utilization. Actual utilization is unlikely to be higher than 95% and probably closer to 75%. Still, it’s a formidable amount of processing for neural network applications. Lets go with 75% utilization, which gives us 40Tops per TRIP or 80Tops total.

    As a point of reference - Google’s TPU V1, which is the one that Google uses to actually *run* neural networks (the other versions are optimized for training) is very similar to the specs I’ve outlined above. From Google’s published data on that part we can tell that the estimates above are reasonable - probably even conservative. Google’s part is 700MHz and benchmarks at 92Tops peak in actual use processing convolutional neural networks. That is the same kind of neural network used by Tesla in autopilot. One likely difference is going to be onboard memory - Google’s TPU has 27MB but Tesla would likely want a lot more than that because they want to run much heavier layers than the ones that the TPU was optimized for. I’d guess they need at least 75MB to run AKNET_V9. All my estimates assume they have budgeted enough onboard SRAM to avoid having to dump intermediate results back to DRAM - which is probably a safe bet.

    With that performance level, the HW3 neural nets that I see in this could be run at 1000 frames per second (all cameras simultaneously). This is massive overkill. There’s little reason to run much faster than 40fps for a driving application. The previously noted AKNET_V9 “monster” neural network requires something like 600 billion MACs to process one frame. So a single “TRIP”, using the estimated performance above, could run AKNET_V9 at 66 frames per second. This is closer to the sort of performance that would make sense and AKNET_V9 would be about the size of network one would expect to see running on the trip given the above assumptions.”
     
    • Informative x 74
    • Like x 13
    • Love x 12
    • Helpful x 1
  2. verygreen

    verygreen Curious member

    Joined:
    Jan 16, 2017
    Messages:
    2,418
    Location:
    TN
    Duh, of course there needed to be a typo in title that I totally missed and cannot edit apparently?
     
    • Funny x 14
    • Like x 2
  3. Anner J. Bonilla

    Joined:
    Oct 10, 2014
    Messages:
    107
    Location:
    Miami FL
    Great writeup.

    Quick question.

    This trip firmware is similar to a FPGA image? being sent to the TRIP device over PCIe? Is it any particular recognized architecture format ie. bytestream for a xiling fpga or altera?
     
    • Like x 1
  4. verygreen

    verygreen Curious member

    Joined:
    Jan 16, 2017
    Messages:
    2,418
    Location:
    TN
    it's not an FPGA image. It's really a NN "code" + "data" if you will.
     
    • Helpful x 1
    • Informative x 1
    • Like x 1
  5. BigD0g

    BigD0g Active Member

    Joined:
    Jan 12, 2017
    Messages:
    1,922
    Location:
    Somewhere
    You can report the post and ask to have the subject fixed. Great writeup very informative!
     
    • Helpful x 2
    • Like x 2
  6. lunitiks

    lunitiks Cool James & Black Teacher

    Joined:
    Nov 19, 2016
    Messages:
    2,708
    Location:
    Prawn Island, VC
    Those charts by D-man!

    [​IMG]
     
    • Like x 5
    • Love x 2
    • Funny x 1
  7. mongo

    mongo Well-Known Member

    Joined:
    May 3, 2017
    Messages:
    9,909
    Location:
    Michigan
    Could this be running multiple copies of the kernel in parallel? 16x16 = 256 weights, repeated multiple times to do a few regions at once? So 1024 locations runs the same kernel 4 times.

    The duplicate processor may be for safety redundancy.
     
    • Like x 4
  8. BigD0g

    BigD0g Active Member

    Joined:
    Jan 12, 2017
    Messages:
    1,922
    Location:
    Somewhere
    There appears to be an A side and a B side ala HW 2.5, so they can be running both if they wanted, but it appears from the post that it's just like 2.5 where the A side does everything and the B side just sits there.
     
    • Helpful x 1
    • Like x 1
  9. mongo

    mongo Well-Known Member

    Joined:
    May 3, 2017
    Messages:
    9,909
    Location:
    Michigan
    Right, so sized for one half and not spending development effort on the redundancy side of the code. For FSD, run the same code on both sides, if they disagree, fault and fail-safe.
     
    • Like x 3
  10. MorrisonHiker

    MorrisonHiker S 100D 2019.32.2.2

    Joined:
    Mar 8, 2015
    Messages:
    6,918
    Location:
    Colorado
    Does that mean the "Tesla chip" is really from Samsung or is this a different processor altogether? I thought Tesla was working on their own chip design. :confused:
     
    • Like x 1
    • Funny x 1
  11. chillaban

    chillaban Active Member

    Joined:
    May 5, 2016
    Messages:
    3,610
    Location:
    Bay Area
    The Tesla PCIe accelerator is probably their own design. They likely did not design their own ARM SoCs for the part that runs Linux and the control algorithm / controls the NN accelerator, which makes sense. It takes a lot of resources to design your own CPUs and SoC (and then bring up an OS on it), and I don't see a clear reason why Tesla benefits from that versus just taking the lowest competitive bid on some sort of SoC that runs Linux.
     
    • Informative x 6
    • Like x 6
    • Helpful x 2
  12. verygreen

    verygreen Curious member

    Joined:
    Jan 16, 2017
    Messages:
    2,418
    Location:
    TN
    the TRIP chip is the Tesla chip (they still contract some fab to make it, likely Samsung sicne they switched to their SoC too?) it's not a general purpose CPU, more like a Google TPU, very special purpose device.
     
    • Informative x 11
    • Helpful x 1
    • Like x 1
  13. kdday

    kdday Member

    Joined:
    Oct 29, 2016
    Messages:
    768
    Location:
    AZ
    Very interesting that the latest firmware files are a port of EAP onto HW3 and don't yet yield any insight into the mysterious "AKNET_V9" all-seeing eye stuff. I wonder when/if we'll start to see those NNs be deployed.
     
    • Like x 3
  14. Bladerskb

    Bladerskb Senior Software Engineer

    Joined:
    Oct 24, 2016
    Messages:
    1,676
    Location:
    Michigan
    #14 Bladerskb, Jan 4, 2019
    Last edited: Jan 4, 2019
    Without reading the post at all and only looking at just the picture. i just wanted to say that non-stacked convolution layers is pretty standard in the world of CNN nowadays, it all started with the first inception network from google which Google really showed that you don't have to just stack conv layers with pooling, drops, on top of each other, but that you can get clever with it. That and using smaller conv filters there is nothing mind blowing about it today. its pretty standard.

    Here is Inception v1 from 2014

    [​IMG]

    EDIT: After reading the post. My comments doesn't change. Only thing i would add is that I'm surprised people still take anything jimmy_d says seriously.
     
    • Disagree x 9
    • Like x 1
  15. jimmy_d

    jimmy_d Deep Learning Dork

    Joined:
    Jan 26, 2016
    Messages:
    413
    Location:
    San Francisco, CA
    How about "Tesla Redundant Inference Processor" for trip?
     
    • Like x 5
  16. Engr

    Engr Member

    Joined:
    Nov 8, 2018
    Messages:
    28
    Location:
    UK
    Yhh. I like your sniky comment of "too arrongant to read the post that mentioned the exact comparison."

    Keep it up
     
    • Like x 2
  17. shiny.sky.00

    shiny.sky.00 New Member

    Joined:
    Jan 4, 2019
    Messages:
    2
    Location:
    Earth
    How did you conclude it was an Exynos SoC? There is no Exynos SoC with the Cortex A72 core.
     
  18. Bladerskb

    Bladerskb Senior Software Engineer

    Joined:
    Oct 24, 2016
    Messages:
    1,676
    Location:
    Michigan
    I always like @verygreen posts because they are filled with actual evidence, I'm currently at work so i couldn't read it. But I wanted to make a quick reply as a placeholder so i just glanced at the pictures. It ended up being they were talking about the similarities with Inception network. When I initially reviewed the post i saw the mind blown gif and I thought @verygreen posted that. So it lead to my response.

    But after reading the post my only conclusion is that I wish he hadn't included jimmy's post in it and will not thumb it up because of that. As I'm surprised people still take anything jimmy_d says seriously.
     
    • Disagree x 2
  19. verygreen

    verygreen Curious member

    Joined:
    Jan 16, 2017
    Messages:
    2,418
    Location:
    TN
    I'd go with "Totally Redunkulous Inference Processor" in that case ;)
     
    • Funny x 9
    • Like x 1
    • Disagree x 1
  20. verygreen

    verygreen Curious member

    Joined:
    Jan 16, 2017
    Messages:
    2,418
    Location:
    TN
    • Informative x 3

Share This Page

  • About Us

    Formed in 2006, Tesla Motors Club (TMC) was the first independent online Tesla community. Today it remains the largest and most dynamic community of Tesla enthusiasts. Learn more.
  • Do you value your experience at TMC? Consider becoming a Supporting Member of Tesla Motors Club. As a thank you for your contribution, you'll get nearly no ads in the Community and Groups sections. Additional perks are available depending on the level of contribution. Please visit the Account Upgrades page for more details.


    SUPPORT TMC