What is DRAM “Row Hammer”?
One of those nasty little secrets about DRAM is that bits may get corrupted by simply reading the bits in a different part of the chip. This has been given the name “Row Hammer” (or Rowhammer) because repeated accesses to a single one of the DRAM’s internal “rows” of bits can bleed charge off of the adjacent rows, causing bits to flip. These repeated accesses are referred to as “hammering”.
Although this was once thought to be an issue only with DDR3 DRAMs, recent papers (listed on the DDR Detective) show that DDR4 also suffers from Row Hammer issues, even though DRAM makers took pains to prevent it.
One big champion of this phenomenon is Barbara Aichinger (pictured) of FuturePlus Systems, a test equipment maker that specializes in detecting row hammer issues. The Memory Guy has had the pleasure of talking with her about this issue and learning first-hand the kind of difficulties it creates.
How does Row Hammer work? It stems from the fact that today’s DRAM chips use very small process geometries that reduce the number of electrons stored in the cell, while the close proximity of neighbors accelerates the rate at which electrons leak onto and off the cell. If too many electrons are removed from or added to a cell the cell will change states.
Row Hammer is most likely to occur when one DRAM row is repeatedly activated. If several accesses to one row occur before the internally-adjacent rows have been activated or refreshed then the charge from the over-activated row leaks into the adjacent rows, causing bits to change states. This is not a permanent error. The cell will maintain its data under normal circumstances, but will lose data if its neighbor experiences unusually high activity.
One simple approach is to double the DRAM’s refresh rate. This reduced the likelihood that there will be enough activity between refresh cycles to cause data corruption. This is not popular since it interferes with access to the DRAM, slowing the system down, while increasing the memory’s power dissipation.
Standard error correction techniques have been tried as well, but since an entire row is threatened the number of potential bit errors within a row can be in the tens of thousands, making it impractical to correct. Even if it weren’t, ECC adds cost and slows the DRAM’s access time.
Why would a single row receive multiple activations? Certain systems use DRAM bits as semaphores to communicate between multiple processes or between multiple CPUs. One CPU might be idling in a loop waiting for another CPU to write something into that bit. Both CPUs will activate that bit’s row every time they read or write to it.
Companies like Google and Intel, as well as all three DRAM makers: Samsung, Micron, and SK Hynix, have researched this issue and determined their own approaches to solving it.
FuturePlus sells an analyzer that monitors activity on the DRAM module, allowing engineers to determine when and where a Row Hammer is occurring so that steps can be taken to specifically avoid those situations.