Notes on Using Error Correction with Flash Memory, March 2005, Morgan Colmer (CTO) Global
Silicon R&D Labs, Cambridge, UK
1.0 Background One solution to this is to ensure that certain
frequently accessed data items are not written All flash memory products suffer from a finite back to the same area of the flash memory but number of erase cycles that they can withstand. rotate around the memory to spread the “ware With the die area of flash memories becoming and tear” over the entire device. larger all the time, the statistical probability of any given bit in the memory becoming Clearly, the manufacturers do not want to be damaged increases. constantly increasing the level of in-built sophistication of flash drives as this increases For bulk storage applications, the popular the cost without necessarily giving a perceived choice is NAND flash because of the increased benefit to the user, and the flash manufacturers data density compared to NOR flash, the chief are unlikely to want to make a big issue about drawback with using NAND flash is that the the inherent unreliability of their products! individual bits of bytes cannot be randomly accessed, the device is arranged like a hard 1.1 The Consumer Audio Market disc drive, into 512 byte sectors. When a flash One of the unique problems faced by the IC is manufactured and tested, it is expected consumer audio manufacturers is that their that some of these sectors will be damaged due costs are increasingly being linked to the to the process and so extra sectors are available commodity memory market pricing and as the to replace those lost to general semiconductor acceptance of digital media grows, so this yield issues. Often there is a complex trend will increase. The end customers, like controller that makes this process invisible to Wal-Mart do not allow their audio suppliers to the outside, typically there are 2% extra sectors factor this memory price fluctuation into their available for this. A typical NAND flash buy-price (as is the case in the PC market) and sector can be reprogrammed about 10000 this leaves the audio suppliers exposed to the times. fickle whims and trends of the current Because of the inherent limited endurance of memory market. flash memories, many manufacturers put some The consumer audio industry has constantly simple error correction into the memory. sought ways of overcoming this, and recycling Typically, they use Hamming codes and has become commonplace. DRAM, another increase the sector size by a further 16 bytes to commodity memory product is frequently accommodate the error correction overhead, salvaged from old SIMMS and often at a but this data space is not available to the fraction of the ambient market prices. With a outside system. Clearly all of these techniques revolution in NAND flash demand from the take up extra die area on the flash device to audio electronics industry poised to happen, it perform these functions. seems very likely that this type of memory Using the error correction, the flash memory product will also become targeting by can correct only one bit in one sector (1 bit in component recycling companies. 4096 bits) and detect 2 bit bits in error per Recycled flash memory will be characterised sector. The flash manufacturers claim that this by a number of factors; (i) older technology is sufficient for most purposes; however filing and (ii) higher probability of defective sectors. systems can cause the level of damage Any flash controller by a new entrant to this sustained by certain sectors to be greatly market must be capable of accommodating increased, causing the product to fail in a very recycled flash memory. short period of time. Filing systems such as FAT16 and FAT32 save two copies of a table that is used to tell the host 2.0 Extended Error Correction processor where everything is stored on the To be able to make use of older recycled flash device, every time any part of the bulk memory, an extended error correction scheme memory is changed, it will cause these two needs to be applied for two reasons; (i) older copies of this essential data to be re-written. In NAND flash memory, a single location or byte memory types do not even have the simple cannot be individually erased, and entire block Hamming code error correction included in them, and (ii) it is likely that the capability of (covering several sectors) must be formatted the Hamming codes has already been exceeded and re-written. This causes premature failures (that’s why it’s being recycled in the first to many devices such as thumb drives. place) and the flash memory is already MIPS for a typical 128 kbps MP3 file considered “broken”. depending on the level of errors found in the data and the total memory use would be There are many methods of performing error approximately 1.5 Kbytes. correction to digital data streams and all of these will involve a computational and memory overhead, some more burdensome than others. All FEC correction systems are complex and sophisticated pieces of IP that generally take considerable time and effort to develop. Fortunately, for Global Silicon, the Sony Corporation did a lot of thinking about how to implement a powerful, yet lightweight error correction algorithm called CIRC. CIRC, or Cross Interleaved Reed-Soloman Clearly this technology is equally applicable to Code, is a very powerful error correction DRAM or any other type of solid state algorithm that was designed in the 1970’s for memory device. the CD player standard. Because at this time, memory and MIPS were both very expensive, As a method of further improving the error Sony invested a lot of time and money to come correction capabilities, it is possible to up with an algorithm that was efficient on additionally interleave the data to allow both. multiple sectors to be corrected. If, for example, the data was written to the memory If only the second stage of error correction is device with the data interleaved over 4 sectors used, then this in conjunction with the de- then the error correction system could be able interleaving buffer would allow up to 4096 to fully recover the data from 4 consecutive contiguous bits in error could be corrected sectors that were completely corrupted. This without a single bit of the erroneous data being extra interleaving comes at the cost of extra found by the host CPU. This is clearly 4096 memory being required to process the data, but times better than the current error correction clearly can be extended to permit the and without the enormous overhead that might maximum length of the correctable data to be be expected by casual inspection. extended to any length given sufficient In the typical flash memory error correction working memory in the CPU. algorithms used by memory suppliers, the The diagram below shows the increased redundancy is 16 bytes in every 512, thus 3.1% interleaving structure when operated over four of the data stored to the flash memory is the 512 byte sectors of a typical NAND flash error correction overhead, in a system based memory. upon CIRC, this redundancy level rises to 12.5% (only using C2), however this is for a 4096:1 increase in the error correction capacity. The CIRC error correction can also re-use the extra space available from the now- unused Hamming code system which takes the data redundancy down to only 9.4%. The purpose of the error correction; to allow recycled flash memory to be used, is not the only advantage that this system gives – the lifetime of flash products can also be considerably increased without the need for costly silicon solutions aboard the flash memory device. In the Global Silicon application of this technology it is anticipated that the great majority of the processing would be implemented in software but make use of the special instructions that are present from the CD data decoder to accelerate the process. The processing overhead for a complete encode and decoder should be less than 1 to 4 Key bn = bit number, wn = word number (in this case the words are 8 bits long).