In memory technology, a “disturb” refers to data loss in one or more memory cells in a memory array. This can result from many causes ranging from environmental factors such as, for example, radiation by alpha particles or other ionized atoms and power supply glitches. They can also occur from operations on one or more other memory cells in the same array. Disturbs can occur in most memories. Failure mechanisms can vary from technology to technology (e.g., DRAM, SRAM, Flash, etc.) and can differ between different manufacturers and even between process generations in the same technology from the same manufacturer.
One of the characteristics of DRAM technology is that data is stored by capturing a quantity of charge on a capacitor in each memory cell. Accessing a memory cell is destructive, meaning that the data in all the cells in a row must be read and then rewritten to the cells in order to restore the charge level to its original condition before de-accessing the row. Thus a read access is effectively a read-restore operation and a write operation is effectively a read-modify-restore operation.
In most applications a DRAM controller is used to manage the complexities of DRAM operation details. If a row of memory cells is not accessed periodically in the course of operation, the charge in the memory cells can leak away resulting in data loss. The DRAM controller is responsible for managing this by issuing refresh (REF) commands to the DRAM with sufficient frequency that each memory cell undergoes a read-restore operation at least once during the specified refresh cycle.
In recent generations of DRAM devices, a disturb mechanism known as row hammering has been discovered that can be exploited by malicious persons who attempt infiltrate a computer system and gain access and/or control (hereafter “attackers” and often colloquially known as “hackers”). This vulnerability results from smaller, more densely packed memory cells in current generation DRAMs. Since the word lines are physically closer than in previous generations, the capacitive coupling between adjacent word lines is increased. Repeated activation of a word line (the “target row”) induces repeated partial activation on the two adjacent word lines (the “victim rows”). This in turn leads to charge loss from the cells on the victim rows which can result in some cells losing their data prior to the next refresh of that row. A variation of this known as “double hammering” is an attack in which two target rows on either side of a single victim row is repeatedly accessed causing disturbs more quickly.
DRAM integrated circuits are typically organized into banks which allows commands to be directed to different banks at different times substantially in parallel allowing multiple simultaneous operations to be performed in different parts of the memory. Typically, to perform an access operation (read or write) on a bank, a row is activated (or “opened”) by issuing a row activate command (ACT) for that bank and specifying a particular row address in that bank. This allows a succession of read and/or write operations to be performed at memory column addresses located on that row. When an access to a row is complete, the row must be deactivated (or “closed”—also known as pre-charging) by issuing a pre-charge command (PRE) to that bank or by issuing a pre-charge all command (PREA) to all banks at once.
Row hammering may involve issuing repeated pairs of an ACT command and a read with auto pre-charge command (RDA) to a particular target row (or rows) attempting to alter the data in one of the adjacent victim rows. The RDA command executes a combination of a normal read command (RD) with an immediately following pre-charge (an “auto pre-charge”) for that row. This may be the fastest way to execute a row hammer attack without being obvious (and thus easily detectable), since a series of ACT and immediate PRE commands without read or write operations would serve no legitimate purpose.
This is an effective attack method because typically one or more memory pages (usually four kilobytes in modern systems) can be stored into a single row allowing the processor to access one or more entire pages at a time. Thus row disturbs caused by accessing a particular page will occur in a completely different memory page—and therein lies the problem.
In most modern operating systems (OS), main memory is typically virtualized. This means each page has a “physical address” corresponding to the physical location in the DRAM and a “virtual address” which is what the operating system and user applications manipulate to emulate larger contiguous memory spaces. The OS maintains a “page table” which keeps track of the translations between each virtual page and its physical counterpart. Each page in the memory has a data record in the page table known as a page table entry (PTE). Since PTEs are also stored in main memory they are vulnerable to row hammering attacks.
Typically, different pages have different levels of privilege (e.g., the user security level required to access that page). Thus an attacker can launch a non-privileged application running a row hammering attack which can in turn corrupt data in memory locations where it does not have any access privileges. These locations may belong to another application or to even the operating system itself. This creates a security violation. Once the violation occurs, the attacker can use a variety of techniques beyond the scope of this disclosure to gain access to and/or control of the system.
A recent paper based on research conducted jointly by Carnegie Mellon University and Intel Corporation entitled Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, by Yoongu Kim, et al, IEEE 41st International Symposium on Computer Architecture, June 2014 (henceforth Kim)—which is hereby included by reference herein in its entirety—analyzed the problem and suggested seven possible solutions: [1] make better memory chips, [2] correct errors with error correction coding (ECC), [3] refresh more frequently, [4] retire weak cells (by the manufacturer), [5] retire weak cells (by the end user), [6] identify target rows and refresh their neighbors, and their proposed solution [7] probabilistic adjacent row activation.
Kim solution [3], increasing the refresh rate, is the current conventional approach. In most current generation systems, doubling the refresh rate will eliminate the problem by insuring each row gets refreshed before a row hammer attack can do sufficient damage to cell charge to cause errors. While this has the virtue of simplicity, it requires an increase in system power which is undesirable in data center applications (due to the high power density) and in battery operated devices such as cellphones, tablets, and laptop computers (where long battery life is a major selling point). It also detracts from system performance since additional refresh cycles reduce memory system bandwidth.
Kim solution [1] is to design better memories. The major DRAM manufacturers have attempted to improve their memory designs, with some success. For example, the JEDEC LPDDR4 (Low Power Double Data Rate 4) SDRAM Standard, JESD209-4, August 2014 (henceforth JEDEC LPDDR4)—which is hereby included by reference herein in its entirety—includes an optional feature called Target Row Refresh (TRR). If TRR is implemented, the LPDDR4 part is tested by the manufacturer to determine the Maximum Activate Count (MAC) for that particular part—the MAC being the number of repeated ACT and PRE (or PREA or RDA) commands between refresh cycles that can be tolerated in a single row before row hammering can cause a memory disturb.
The memory controller or operating system must track the number of row activations that have been issued to each row to determine if the MAC limit has been reached. Then the part must be put into its idle state (by pre-charging all banks) before entering TRR mode to perform three successive refreshes to the target row and its two adjacent neighbors. Since the memory controller only knows the target row, the SDRAM on-chip TRR circuit assists by internally identifying the two victim rows and handles their addressing for the controller. This places a substantial burden on the memory controller and/or the operating system software, thereby adding significant complexity to designing a secure system.
Although TRR is not a part of the JEDEC DDR4 (Double Data Rate 4) SDRAM Standard, JESD79-4, September 2012 (henceforth JEDEC DDR4)—which is hereby included by reference herein in its entirety—the major DRAM manufacturers have incorporated a TRR implementation into their most recent DDR4 offerings.
For example, Micron Technology offers a TRR circuit in their DDR4 parts which is similar (but not identical) to the LPDDR4 feature. Micron claims that while the circuitry is there, it is not usually needed since the majority of tested parts have no vulnerability. Unfortunately, most-but-not-all of the time leaves the system designer needing to deal with the not-all case which, in practice, is akin to the LPDDR4 solution.
SK Hynix also offers a TRR circuit on its recent DDR4 products similar (but not identical) to both the LPDDR4 and Micron solutions. This has the same drawbacks. Additionally, since these TRR circuits are not standardized, system designers must now make allowances for which manufacturer their DRAMs are sourced from and include the appropriate algorithms for both.
Samsung has a third solution known as “pseudo-TRR,” though the details are not publicly available. Samsung claims that the combination of pseudo-TRR and doubling the refresh rate will solve the row hammering problem, which suggests their answer to the problem is a combination of Kim solutions [1] and [3].
Kim solution [2], using error correction codes (ECC) is expensive and has limitations. Currently ECC is only used in data center and enterprise class memory modules, being too expensive for most consumer systems. ECC SDRAM modules typically use a Hamming single error correction, double error detection (SECDED) code. The Kim study notes that row hammering attacks frequently cause multiple errors in the typical 64-bit DRAM data word and that SECDED is insufficient to mitigate the problem alone. Stronger error correction codes (e.g., Reed-Solomon, binary BHC, etc.) can be used, but they are computationally intensive requiring considerable time, power, additional memory cells (to hold the parity bits for each data word), and silicon area to implement. This makes them undesirable for use in fast system memory applications and expensive for low performance systems.
Another issue with ECC is that in order to correct errors the data must be read out of the DRAM (perhaps during a refresh cycle), decoded, corrected, re-encoded and then written back into the memory cells. This takes longer that a normal refresh cycle and further increases power while decreasing memory bandwidth.
The Kim paper is fairly dismissive of solutions [4] and [5]. It states that solution [4], having the manufacturers retire victim rows before shipping the product, is impractical due to both test time and to the potential number of spare rows needed. Kim also observes that solution [5], having the user retire victim cells, simply throws the same burden on the system designer who has to find and replace bad memory rows performing analogous operations at the system level at significant cost in processing time and available memory.
Kim is also dismissive of solution [6], which is to identify target rows and refresh their neighbors. Since it is impractical to have an access frequency counter for each row in a memory chip, complicated algorithms, searches and approximations must be used, and these can yield many false hits requiring many unnecessary additional refresh cycles.
The Kim advocated solution [7], probabilistic adjacent row activation, has the virtue of simplicity and low overhead but is not without its drawbacks. The approach is to “flip” a biased “coin” after each active and pre-charge pair. Thus randomly (Kim suggests on the order of one in a thousand row activations) one of the two adjacent rows is randomly activated and then pre-charged (the equivalent of a refresh for that row). It may take many thousands of row activations to induce an error (50,000 or more according to Kim, or 200,000 or more according to JEDEC LPDDR4). Thus a row targeted many times may have a high probability that both of the adjacent victims will get refreshed long before the hammering attack succeeds in causing a disturb error, thus resulting in an acceptably low error rate that can be tuned for a particular system.
The downside to probabilistic adjacent row activation, like most of the other solutions, is that it places the burden, albeit lighter than most of the others, on the memory controller and/or software and requires adjacency information that the memory manufacturers typically do not provide and may not be willing to provide in the future. Kim suggests a possible work-around by making educated guesses about adjacency between rows, but this simply increases the overhead required (due to unnecessary refreshes when the educated guesses are wrong) while reducing the quality of the results (since the real victim row may be missed). Also, many engineers prefer to implement deterministic hardware and/or software (and/or may be required to do so by their managers) and may find the non-deterministic nature of probabilistic adjacent row activation to be unacceptable.
Thus it is highly desirable to have a solution to the row hammering problem that is substantially transparent to the memory controller and/or software and handles the issue internally to the DRAM with little overhead and minimal involvement from the memory controller or operating system.