I have investigated this error under (scanning electron)microscope and probing. The problem is related to an OTP register inside the controller of the chip. The register fault varies a lot from chip to chip, but is definitively defective due to a silicon design flaw, it seems to be all "GMRA" chips, but failure rate depends on the production date. This also explains the variation in failure rate. When the memory area is near fully occupied, the corrupted OTP register will make the chip loose vital information, and write cycles are not performed, but no data is overwritten, a few garbage bytes (64) are sometimes written though. The crazy logging going on is not really the root cause of the chips going bad, but much more the percentage of occupied cells for data and then time. The number of erase/read/write cycles are way less destructive.
The good news is that almost all of the bad chips can be read out. I do that professionally, and I know of two other companies that does the same. With my method, it takes about an hour to read out the vital key and certificates.