Yeah.
So the reason I suggested a shared RAM design is that I think there's a chance that the Tesla AI NN chip has a
really radical design: DRAM integrated onto the NN CPU die itself. This is a relatively modern technique that Intel (Haswell and later) and IBM (Power chips) are using:
(Having all the weights in SRAM doesn't seem possible currently: the simplest SRAM cell design would require about ~48 billion transistors for 1 GB of weight data which would result in too large dies - and indications are that they are using at least that much weight data.)
The Tesla NN chip might have gone one step further and basically integrated the NN forward calculation functional units into the DRAM cells themselves. One possible design would be that there's an NN input/output SRAM area in the 10-30 MB size range, and the functional units propagate those values through the neural net almost like real neurons.
Such a design would have numerous advantages:
- Heat dissipation properties would be very good, as all the functional units would be distributed across the die evenly in a very homogeneous layout.
- Execution time would be very deterministic as there's effectively no caching required.
- Lack of caching also frees up a lot of die area to put the eDRAM cells on.
- This design would also allow very small gate count mini-float functional units and very high inherent parallelism.
- Scaling it up to higher frequencies would also be easier, due to the lower inherent complexity and the lower critical path length.
- All of this makes it very power efficient as well, i.e. a very high NN throughput for a given die size, gate count and power envelope.
In such a design external RAM modules have a secondary role: they are basically just for initializing the internal "neurons" (multiplier and saturated-add functional unit) and "axons" (weight value) with the static neural net, and to store the output results.
Other designs are possible too - such as self-contained all-in-one 'neuron' functional units that are programmable to perform a given loop of weight calculations with no external communications other than the input fetches from other functional units, the eDRAM cell fetches and the output stores (i.e. intermediate state would not be stored anywhere external outside the functional unit, it's all within small local registers in the functional unit itself with no bus access to them whatsoever) - but the basic idea is to have the NN weights data on-die.
If that's the NN chip design Tesla invented then I'd expect the NN chips on multi-chip boards to share any external RAM, as it's not a performance bottleneck anymore.
But maybe I'm missing some complication that makes such a design impractical - for example the latency of eDRAM cell fetches would be a critical property.