Background memory layers L1-L2 are based on static RAM (SRAM) memories today. SRAMs nowadays are limited by sub-threshold leakage and susceptibility to read/write failures with dynamic voltage scaling schemes or a low supply voltage. As a result, considerable effort and resources are invested in developing emerging memory technologies like Resistive RAM (ReRAM), Ferroelectric RAM (FeRAM), Spin Transfer Torque Magnetic RAM (STT-MRAM) and Phase Change RAM (PRAM). Due to a variety of characteristics like low leakage, high density and inherent non-volatility, non-volatile memories (NVMs) are being explored as alternatives for SRAM memories even at higher levels of the memory hierarchy like scratch-pad and cache. Research on these NVMs has become even more necessary as memories are increasingly dominating system on chip designs in terms of chip area, performance, power consumption and manufacturing yield. In almost all proposals to incorporate NVMs into the traditional memory hierarchy, they are utilized along with SRAM. Negative impacts (latency and reliability issues being the major ones) can so be limited and the positive impacts maximized.
STT-MRAM and ReRAM are some of the more promising and mature NVM technologies. STT-MRAM is a good candidate to replace conventional SRAM technology for large-size and low-power on-chip caches. STT-MRAM has high density, lower power consumption, good performance (relative to other NVMs and Flash) and suffers minimal degradation over time (lifetime up to 1016 cycles). ReRAM is also an attractive prospect due to e.g. the large R ratio, fast read access times, small read energy consumption and area requirement. Note that the R-ratio is the ratio between the high resistive state resistance and the low resistive state resistance of the memory element. ReRAM and STT-MRAM technology are also CMOS logic compatible and can be integrated along with SRAM on chip. ReRAM, however, is plagued by severe endurance issues (lifetime≤1012 cycles). Therefore, STT-MRAM seems the most promising NVM.
Despite the low energy, leakage and very good endurance, STT-MRAM read and write latencies are an issue when higher level memories, i.e. memories closer to the computational data path, are targeted. As a result, a direct drop-in replacement of SRAM by STT-MUM in the D-caches organization is not feasible.
There have been a number of proposals based on hybrid NVM/SRAM organizations for various levels of the memory hierarchy. They use almost all a combination of software (memory mapping, data allocation) and hardware techniques (registers, buffers, circuit level changes) to overcome the problems plaguing these proposals. In “Optimizing data allocation and memory configuration for non-volatile memory based hybrid SPM on embedded CMPs” (J. Hu, et al., IEEE 26th Int'l Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), May 2012, pp. 982-989) a Hybrid Scratch Pad Memory (HSPM) architecture is proposed which consists of SRAM and NVM to utilize the ultra-low leakage power, high density of NVM and fast read of SRAM. A novel data allocation algorithm as well as an algorithm to determine NVM/SRAM ratio for the HSPM architecture are proposed.
To improve the write latency, an asymmetric write architecture with redundant blocks has been proposed, wherein the asymmetric write architecture utilizes the asymmetric write characteristics of 1T-1MTJ STT-MRAM bit-cells. The asymmetry arises from the nature of the storage element in STT-MRAM, wherein the time for the two-state transitions (1 to 0 and 0 to 1) is not identical. Others have attempted to supplement the MRAM L1 cache with several small SRAM buffers to mitigate the performance degradation and dynamic energy overhead induced by MRAM write operations. Nevertheless, only mitigating write latency does not sufficiently solve the performance issues of non-volatile memory technologies. Read latency can be addressed at the same time.
It is quite clear from work in related areas that NVMs haven't been looked into as options for the highest level of the memory hierarchy very often. Not much effort has been paid in alleviating or bypassing the read latency limitations. Additionally, the write latency oriented techniques do not lead to good results and they do not really mitigate the real latency penalty. However, when considering an ARM like general purpose processing platform, the latency issues are crucial to the success of the overall system.
The rapid increase of leakage currents in CMOS transistors with technology scaling poses a major challenge for the integration of SRAM memories. This has accelerated the desire to shift towards newer and more promising options like STT-MRAM. However, as mentioned earlier, latency issues limit the use of STT-MRAM for higher level memories. Previous concerns related to STT-MRAM and other similar NVM technologies were along the lines of write-related issues. The read latency of STT-MRAM is significantly larger than its SRAM counterpart. The read-write latency depends a lot on the R-ratio (tunnel magnetoresistance in the case of STT-MRAM) in these NVM technologies. With the maturation of the STT-MRAM technology it has become clearer that a high R-ratio is, at least currently, not realistic, taking into account the cell stability and endurance (shift from 1T-1MTJ to 2T-2MTJ). Hence, the read latency has become the new major bottleneck to overcome for substituting SRAM by STT-MRAM, particularly at the L1 level of the memory hierarchy.
Write latency issues can still be managed by techniques like the inclusion of a small L0 cache or buffers. A simulation can show that these latency issues, in particular read latency, have a major impact on performance when NVMs are used in the first levels of the memory hierarchy, even for data caches that are not so read dependent like instruction caches.
FIG. 1 shows the performance penalty on replacing just the SRAM D-cache by a NVM counterpart with similar characteristics (size, associativity . . . ). The instruction cache and the unified L2 cache remain SRAM based. Even for the minimal read latency issue considered here a clear and unacceptably large performance overhead can be observed compared with the baseline. In fact, “reg-detect” may suffer up to 55% performance penalty if the NVM data cache is introduced instead of the regular SRAM one. FIG. 2 shows the performance penalty on replacing the SRAM D-cache by a NVM counterpart with similar characteristics for a VLIW processor specialized for wireless baseband processing.
The main conclusion of this analysis is that although STT-MRAM can be a good candidate to replace SRAM data caches, a drop-in replacement may not be advisable and some architecture modifications may be used to reduce the impact of their latency limits.
Hence, there is a desire for a non-volatile memory structure for the levels of the memory hierarchy closest to the computational data path, wherein the above-mentioned problems related to read latency are at least alleviated.