Society's demand for high-availability computing systems is growing along with society's dependency on computers for various services. For example, Internet Data Centers (IDC), Internet Service Providers (ISP), or Application Service Providers (ASP) provide the support for many computing needs. To meet the demand in a way that is affordable to users, computing systems are increasingly being built with commodity hardware and software. Unfortunately, reliability is sometimes sacrificed in systems with commodity parts.
For example, commodity memory components are susceptible to soft errors. A soft error is a transient memory error that has been detected by the hardware but not corrected. Many operating systems respond to soft errors by halting and then rebooting. System reboots are costly in terms of lost production time. If the resources of an IDC, ISP, or ASP are unavailable because of a system reboot, customers' needs may be unmet or frustrated. If computing resources are unavailable too often or for too long, customer dissatisfaction and customer defections may result. Thus, while commodity parts address the requirement of affordability, the requirement of high availability may be sacrificed.
A method and apparatus that address the aforementioned problems, as well as other related problems, are therefore desirable.