Error Correction Codes, ECC, are not only important to today’s NAND flash market, but they have been a cause of concern to NAND users for a number of years. The Memory Guy has been intending for some time to write a low-level primer on ECC, and I am finally getting it done!
Why is ECC necessary on NAND flash, yet it’s not used for other memory technologies? The simple answer is that NAND’s purpose is to be the absolute cheapest memory on the market, and one way to achieve the lowest-possible cost is to relax the standards for data integrity — to allow bit errors every so often. This technique has been used for a long time in both communications channels and in hard disk drives. Data communication systems can transfer more data using less bandwidth and a weaker signal over longer distances if they use error correction to restore distorted data. Hard disk drives can pack more bits onto a platter if the bits don’t all have to work right. These markets (and probably certain others) have invested a lot of money in ECC research and development, and as a result ECC today is a very well-developed science.
Denali Software published a nice white paper way back in 2009 that covered ECC needs for flash memory. Although Cadence, who acquired Denali in 2010, removed the page, that paper can be found today at All-Electronics, a German news website.
The Denali paper centers around the Bose, Ray-Chaudhuri, and Hocquenghem (BCH) error-correction code, which was the leading ECC technique being used at that time. BCH is described in depth in Wikipedia for those who want to learn more. Brief descriptions of the Hamming Codes & Reed-Solomon Coding, technologies that were used prior to BCH, can be found on a 2011 blog post by Cyclic Design.
As with any other ECC mechanism, including simple parity, BCH needs a few extra bits in addition to the data bits to determine whether the data bits have errors and to correct any that do. Error correction algorithms can use these extra bits to correct up to a certain number of data bits, and can detect even larger numbers of errors than they can correct. The formal name for these extra bits is: “Parity Bits,” but few of my industry contacts use that term, calling them instead “ECC Bits” or ECC Bytes.” The table below provides the number of ECC bytes required to correct different numbers of bit errors. The byte requirements also depend on the flash chip’s page size, since the ECC bytes are a portion of the flash chip’s memory page.
The first column (yellow) indicates the page size, from 128 bytes to 8,192 bytes (8KB). The first row (blue) lets you choose how many failed bits you might want to correct, from 1 to 26.
These seemingly-mysterious numbers aren’t all that hard to arrive at by calculation. The required number of ECC bytes is proportional to the log of the page size times the number of bits to be corrected. The specific formula used to calculate the numbers above is: ECC Bytes = (Log2(Page Size)+4)*(Bits Corrected/8). Since the number must be an integer, the result of the calculation is rounded up.
If your understanding of logarithms has grown rusty, just remember that the Log2 of any number is simply the number of bits required to express it: 2 bits can express 4 different values (i.e. the voltages in an MLC cell) so Log2(4)=2. Ten address bits can access 1,024 locations in a memory chip, so Log2(1,024)=10. Running down the left hand column from 128 bytes to 8,192 bytes the Log2 is simply: 7, 8, 9, 10, 11, 12, and 13.
So how many failed bits do you want to correct, anyway? This can be a hard choice – there’s always some chance that you’ll miss one, no matter how much ECC you use.
Furthermore, the number of bit errors increases due to the shrinking processes of NAND flash. As the cells shrink they store fewer electrons, making it harder to detect their value thanks to the noise inherent on the chip. In addition, as more bits are packed onto a cell, going from SLC to MLC, TLC, and eventually QLC, the voltage differences between the levels shrinks, adding further to this challenge. Finally, as the cells get closer and closer to each other they tend to have an increasing influence on the contents of neighboring cells: Writing a cell will often disturb the values of adjacent cells, and even reading a cell can draw electrons off of an adjoining cell to alter its value. So three independent factors work to increase the number of bit errors as costs are driven out of the chip.
The Denali white paper provides some input on the number of errors to anticipate through a chart that indicates the number of failed bits to expect for SLC, MLC, and TLC flash across a range of processes. The paper, however, is pretty dated, so I took the liberty of extending it down to 15nm and removing SLC, which is rarely used any longer. The updated chart appears in the figure below, which is also the graphic at the top of this post:
Denali says that 15,000 gates are necessary to correct 8 errors on a 1KB page, but that roughly 100,000 gates would be required to correct 26 errors. Assuming that the 100K number was rounded up from about 60K, we can calculate that the logic requirements for ECC are about 1,200 gates per ECC byte.
Recall that all of these numbers are minimums. Controller designers try to minimize the added cost of error correction, and will use an algorithm of modest complexity and the smallest possible number of ECC bytes. If cost weren’t an issue the designer could use a much larger number of ECC bytes and a less complex algorithm (requiring fewer gates) to achieve the same level of error correction. Overall, though, this would increase the cost of the system. There’s a tricky balance between gate count and ECC byte count that has an optimum at some level.
As 16nm and 15nm process technologies moved to TLC the bit error rate increased beyond a level that was economical to correct with BCH, and LDPC (Low-Density Parity Check) became popular. BCH is a less complex algorithm than LDPC, so it has a significantly smaller gate count. I was told a year ago that an LDPC ECC engine adds about $1.00 of cost to a new controller chip, but this number will shrink over time thanks to Moore’s Law. Readers who want a better understanding of LDPC can read the LDPC Wikipedia page.
All of the numbers above are for planar flash. Now that 3D NAND accounts for a large and growing share of overall NAND flash bit shipments what is the impact on ECC? 3D NAND actually makes ECC simpler. Here’s why:
Two different mechanisms cause 3D NAND flash to have fewer bit errors than planar NAND chips of the same density. First, the process for 3D NAND is much looser than that of planar NAND flash. Where today’s most economical NAND flash uses a 15nm process geometry, 3D NAND is today built using a 40nm process. This automatically makes the bit’s storage area (whether a floating gate or a charge trap) significantly larger than that of the planar part. Second, the 3D NAND’s floating gate or charge trap forms a circle around the pillar that serves as the channel, increasing its area by another 3+ times. A gate on today’s 3D NAND chips is roughly equivalent to that of a 90nm planar NAND chip.
The graph above indicates that the ECC requirements for 3D NAND then should be fewer then ten bits for MLC, fewer than 15 for TLC, and we can infer that fewer than 20 bits will be necessary for QLC 3D NAND. This is why QLC appears far more practical for 3D NAND than it ever was for planar NAND.
SanDisk was the only company to ramp planar QLC NAND into volume production with its 43nm technology in 2009, and the company only produced this device for about a year. The SanDisk controller did not appear to use LDPC error correction.
Over the long term I would assume that LDPC will be used in most 3D NAND controllers to allow even more than 4 bits to be stored per cell, but it may take some time to get to that point. During the near term it’s reasonable to expect for 3D NAND to largely shift to QLC using simple BCH algorithms.
Another tip of the hat to Chuck Sobey of Channel Science for pointing out a nomenclature error in my description of ECC.