Using ECC to Reduce Power
A couple of papers at last week’s ISSCC (the IEEE International Solid-State Circuits Conference) caught The Memory Guy’s attention. Both SK hynix and Samsung showed low-power DRAM designs in which the refresh rate of the DRAM was reduced in order to cut power consumption, with ECC applied to correct the resulting bit errors.
Although I had not heard of this approach before, I have recently learned that researchers at Carnegie Mellon University and my alma mater Georgia Tech presented the idea in a paper delivered at another IEEE conference in 2015: The International Conference on Dependable Systems and Networks.
Here’s the basic concept: DRAM consumes most of its power performing refresh cycles, the issue for which it was given the “Dynamic” part of its name: Dynamic Random-Access Memory. This use of the word “Dynamic” is a euphemism. In reality the bits are constantly decaying, but that doesn’t sound as nice.
When the technology was developed in the early 1970s DRAM manufacturers offered to provide their customers with really inexpensive RAM bits (compared to SRAM) as long as the customer was content to refresh the entire array every 64 milliseconds. This means that every DRAM bit has to be read, evaluated, and pushed back to its original state, a 1 or a 0, every 64 milliseconds. This is a very manageable approach, but it consumes a lot of power.
Certain forward-thinking researchers recently looked at this situation and said: “What if I don’t?” What if they didn’t refresh every single bit as often as was required? Would the entire contents of the DRAM be lost if the refresh rate was slowed to, say, 65 milliseconds?
In a word: “No.”
The top graphic in this post is CMU professor Onur Mutlu‘s visualization of the way DRAMs really work. The tiny blip in the upper left corner represents the share of cells that must be refreshed every 64-128 milliseconds. These are the lowest element of a DRAM chip. Below that is a column representing more forgiving cells that can be refreshed every 128-256 milliseconds. These are weaker than most cells, but not the worst of the bunch, as are the 64-128 millisecond cells. The vast majority of the cells fall into the >256 millisecond category. That is to say, that almost all DRAM cells can be left alone for 256 milliseconds or longer without being refreshed, and their data will still be good.
Another way of looking at this is shown below. In this chart, from the CMU paper: RAIDR: Retention-Aware Intelligent DRAM Refresh, a log scale of the refresh interval on the X-axis is plotted against the probability of bit errors, also logarithmically plotted, on the Y-axis. The chart on the left covers the range from 10 milliseconds to 10,000 seconds (about 3 hours). The chart on the right is a subset of this chart, the part circled on the left, ranging from 10 milliseconds to 1 second.
These charts show us that there would be about 30 bit errors in a 32GB DRAM array if the refresh rate was slowed to 128 milliseconds, and the number of bit errors would climb to 1,000 if the refresh were slowed to 256 milliseconds. These are manageable numbers. Eventually the number would be large enough that the ECC would drive costs prohibitively high. The trick, then is to refresh often enough that the error rate can be economically corrected.
In the ISSCC papers SK hynix slowed the refresh period to 256ms and used ECC to cut self-refresh current by 75% to less than 100µA. Samsung slowed its refresh time to 384ms and used a different ECC approach, combined with a number of other techniques, to reduce standby power 66% to 0.15mW. Those are pretty impressive power savings!
How economical is it to correct these bits? The amount of logic required varies as a function of the number of bits that need to be corrected. The more errors you correct, the more logic you will need. Quite fortunately, that logic is constantly undergoing Moore’s Law cost reductions – what may have been too costly yesterday is very economical today.
When discussing NAND flash, I have been known to quip that ECC has been advancing at such a rapid rate that future controllers will be able to pull valid data out of non-functioning NAND flash chips. In light of this work I may have to add DRAMs to that comment, and postulate that future ECC may be able to keep anyone from ever having to refresh DRAM chips at all.
Seriously, that’s not going to happen, but it is interesting that ECC can be used to reduce refresh rates, and therefore power consumption. The stronger the ECC, the slower the refresh rate can be. The slower the refresh rate is, the lower the lower consumption will be. That’s pretty amazing!