Enterprise computer systems require very low failure rates to support the extreme reliability requirements of customers. In order to achieve these low failure rates, manufacturers of enterprise computers mandate that the components used in these systems have very low failure rates due to soft errors.
Most current component manufacturers use in-house estimation methods and spreadsheets to compute the failure rates of designs due to soft errors. These estimations are often limited to memories made with SRAMs, and thus do not account for other sequential elements made out of flip-flops. Moreover, these estimates frequently omit non-sequential elements in the design from consideration. Consequently, these estimates tend to underestimate failure rates.
As manufacturing technologies become more advanced, and as the critical dimensions of transistors become smaller (e.g., less than 20 nm), the failure rates due to flip-flops and non-sequential elements have begun to approach the failure rates of SRAMs. Hence, these failure rates need to be computed to ensure that the components still meet the estimated failure rate due to soft error requirements.
At present, the analysis of failure rates due to soft errors is commonly based on the results of questionnaires. These questionnaires, which are typically conducted by the design engineers for various subcomponents of the system, are prone to human errors of omissions and to errors arising from the use of stale data. Moreover, this analysis is typically conducted late in the design cycle due to the manual nature of the work involved. Accordingly, any surprises arising from the results of this analysis tend to cause delays in the execution of the project.
One technique used to reduce the failure rate due to soft errors is the addition of error correction codes to the memory. These include the use of single error correct codes (SECDED) and double error correct codes. These codes introduce logic into the path and cause the design to run slower than the target clock speed. This usually ends up with a re-design of the block which needs the SECDED logic in the memories. Thus, the use of this technique may cause significant delays in the project schedule.
Another requirement for enterprise components is that any failure caused by soft errors that are not corrected need to be flagged (in other words, no silent errors are permitted). One way of meeting this requirement is to add parity to banks of flip-flops, and then detect any parity errors. This tedious process causes additional delays to the project schedule if implemented late in the project, and may necessitate additional manpower to implement any necessary design changes and to verify the main functionality and the added parity.