Designers of next generation Systems-on-Chip (SoCs) face daunting challenges in generating high yielding architectures that integrate vast amounts of logic and memories in a minimum die size, while simultaneously minimizing power consumption. Traditional design approaches attempt to guarantee 100% error-free SoCs using a number of fault-tolerant architectural and circuit techniques. However, advanced manufacturing technologies render it economically impractical to insist on 100% error-free SoCs in terms of area and power.
Fortunately, many important application domains (e.g., communication and multimedia) are inherently error-aware, allowing a range of designs with a specified Quality of Service (QoS) to be generated for varying amounts of error in the system. However, exploitation of error-aware design to address these power, yield and cost challenges requires a significant shift from error-free to error-aware design methodologies.
In communication and multimedia systems, embedded memories are perfect candidates for this exploration, since the share of the SoC that is dedicated to memories has experienced an increasingly upwards trend exceeding more than 50% of the area of an SoC for wireless standards such as DVB, LTE and WiMAX. Furthermore, a large portion of the memory is typically used for buffering data that already has a high level of redundancy (e.g. buffering memories in wireless chips, decoded picture buffer in H.264, etc.). Finally, from a network perspective, buffering memories are transparent across a hierarchy since they do not change the nature of the data stored, which allows for simple and efficient cross-layer techniques.
Dynamic voltage frequency scaling (DVFS) techniques are the traditional techniques to perform power management where a design tradeoff is performed between power and delay where lower power is attained at the cost of larger delay, typically by running at a lower operating frequency which is set by the weakest perform in the overall system. In a majority of scenarios the culprit is embedded memories, since they exhibit the highest vulnerability to supply changes as compared to logic. For this reason, when voltage scaling is used, memories are typically treated separately to maintain the margins such that the device will meet timing 100% of the time with new settings. While this is true for some applications, such as processor memories, there exists a wide variety of applications that are error tolerant by design such as wireless and multimedia devices where the data structures are designed in such a way that there e is a redundancy inserted in the data stream to compensate for a variety of errors sources. In such systems, DVFS may not trade-off the power saving with the forgiving nature of the system.
In prior work, the authors have shown that utilizing fault tolerant techniques on embedded memories (mainly through aggressive voltage scaling) will result in a) 20%-35% power reduction in wireless systems depending on the application, b) savings in cost and area by reducing or eliminating the need for circuit redundancy, and c) achieving a higher “effective yield” by tolerating errors at the system level while keeping other parameters constant.
While the gains are lucrative, accurately evaluating the impact of hardware errors on system performance is a challenge. Typically, hardware error statistics for certain operation conditions (supply voltage, frequency) are gathered and used in a system simulation to evaluate the effectiveness and to quantify the gains of the proposed fault tolerant technique in terms of power savings and system performance impact. This approach suffers from the following major drawbacks:
Lack of scalability: Clearly the design space is very large given the numerous possible combinations of system settings and operation conditions. Since each simulation result is valid only for a specific simulation setup, therefore, for every change in the algorithm or policy, a new system simulation has to be performed, which limits the design space.
Accuracy and simulation time: The accuracy of the obtained results depends on the size of the processed data.