I had a look through the recent Tesla patent release.
There's one main patent which was filed for in Sept 2017 and 3 extensions filed six months later. The main one is titled "Accelerated Mathematical Engine" - I'll attach the pdf here if I can. The others are supplemental and offer little additional insight into the main elements.
The 'Accelerated Mathematical Engine' is a large matrix multiplier which is fed by pipelined data formatting engines which are intended to keep it busy. The patent explicitly calls out the design as focused on performing convolutions for neural networks - especially for image processing. Details of the management system imply that multiple of these engines would be employed as a group for enhanced flexibility, which we can infer as meaning there would be multiple of them on a single IC. There are a lot of details about how the AME is organized internally and some details about how the formatting engines work, supplemented by more detail in the other 3 patents. I'll save you all the grief of a long discussion here and just focus on the highlights.
First - the AME description is a perfect match for what
@DamianXVI found in the firmware provided by
@verygreen, right down to the multiplier granularity and the control scheme so I think this patent probably covers what Tesla actually put into the HW3 TRIP and not some loosely related early version of the design concept. So that's nice.
Second - the AME design is optimized for the kind of calculation that is exclusively used in all the camera networks we've seens so far - they are 100% CNN with fine granularity, high input channel count and high output channel count. AME should run these networks with extremely high utilization rates (over 90%).
Third - while the basic idea of a big matrix MAC (multiply-accumulate) engine that operates on 8 bit fixed point values is the same as what the TPU V1 uses, the AME is not quite the same as a TPU. The TPU is explicitly systolic, keeps weights resident in the array, and feeds input channels and reads output channels synchronously. The AME (accelerated mathematical engine) accumulates output channel results in the matrix and feeds weights and input channels synchronously. So while the goal and concept are similar, the implementation of the TPU and the AME are quite different in their details. My read here is that the TPU went the route that optimized for power efficiency and the AME is more about ease of use and flexibility. That makes a lot of sense when you consider that, to Google, power and heat management are primary utility criteria. For an in vehicle application you care less about unit power consumption and more about future-proofing and fewer code bugs.
AME's engine size seems to be 96x96, which is close to the 96x128 optimum granularity that AKNET_V9 would want on a weight resident systolic system like the TPU. AME is more flexible and may not have those kind of constraints - I haven't redone the analysis for AME's dataflow yet (patent language tends to obscure the kind of detail that I need). In any case, AME appears to be an excellent fit for running the biggest net which we've seen in the firmware - the only one we've seen which is probably HW3 specific.
A single AME will perform 2*96^2 operations per clock - about 20K. The maximum clock rate for the AME is probably in the 500MHz to 2GHz range, with the lower end of that being more likely. If we go with 500MHz you get about 10TOPs / second per AME. I think AKNET_V9 in actual vehicle use probably needs at least a half dozen AMEs at that speed. It would be less if Tesla manages a higher internal clock speed, but I think they probably won't go over 1GHz even if they can. The efficiency hit that you get at the higher clock rates probably adds more risk and system cost than they would save at the IC level from the reduced die area. Additionally they are not real estate constrained (unlike a cellphone or a datacenter) so a bigger, slower IC designed at a mature process node might well be a good tradeoff for them. This chip should be economical at 28nm or lower.
Anyway - I think this all points toward HW3 being able to realize 50 to 100 TOPs. That's about 10x what HW2 has. It's more than enough to allow for a substantial expansion of what the neural networks in AP2 can do.
edit - typo fix