On-chip redundancies are good for improving yield due to testable production defects, trying to detect runtime failures on-chip and then hope you have the right things being duplicated is more difficult. What if you get multiple errors on the chip, and both copies of the NN are affected? How do you decide which to use?
Actually, there is somewhat this problem with only 2 (and not 3 or some other odd number) of chips, too.
If they're using dual chips, it will be to protect against hard failure of a chip dying / locking up / overheating / etc, not so much against random cosmic ray bit flipping or electromigration causing one chip to produce wrong output (as the output might still look valid, and then you don't know which chip to use).
This is more redundancy along the lines of having dual steering motors or redundant braking boosters, etc, rather than 3+ computers having their output voted on as you would for spacecraft for example.
Elon answered this today:
Two, independent system-on-chip architecture, with each SoC having two NN accelerators that can perform simultaneous health-check calculations to protect against a soft error
So they have two chips that each have redundant cores. So they have 4 duplicate running processes to compare against on the FSD computer.
Of course he said the load for the current software, on the FSD computer, is 5%, or "10% with full fail-over redundancy". (Or 80% on the HW2.5 computer.)
Last edited: