Modern microprocessors are vulnerable to transient hardware faults caused by alpha particles and cosmic rays. For example, cosmic rays can alter the voltage levels that represent data values in microprocessor chips. Strikes by cosmic ray particles, such as, neutrons, are particularly critical because of the absence of any practical way to protect microprocessor chips from such strikes. Although the energy flux of these rays can be reduced to acceptable levels with six feet or more of concrete, this width is significantly greater than normal computer room roofs or walls.
Furthermore, as transistors shrink in size with succeeding technology generations, they become individually more vulnerable to cosmic strikes. However, decreasing voltage levels and exponentially increasing transistor counts cause overall chip susceptibility to increase rapidly. To compound the problem, achieving a particular fail rate for large multiprocessor servers, requires an even lower failure rate for the individual microprocessors that comprise the multiprocessor server.
Currently, the frequency of such transient faults is low, making fault tolerant computers attractive only for mission-critical applications, such as, on-line transaction processing and space programs. Unfortunately, as indicated above, future microprocessors will be more prone to transient faults due to their smaller feature sizes, reduced voltage levels, higher transistor count and reduced noise margins. Accordingly, although fault tolerance systems are generally limited to mission-critical applications, future fault detection and recovery techniques, which are currently only used for mission critical systems, may become common in all but the least expensive microprocessor devices.
Several redundant thread (RT) based approaches are proposed to detect transient faults. The basic idea of these approaches is to replicate an application into two communicating threads, the leading thread and the trailing thread. The trailing thread repeats the computations performed by the leading thread, and the values produced by the two threads are compared for error detection. To reduce the performance impacts on the original program execution, the leading thread and the trailing thread should seek to run on different processor (cores) in a multiprocessor system (or many core) environment.
Unfortunately, the RT approach requires thread synchronization and data passing between the replicated threads on each shared memory access. Since memory access accounts for thirty percent of instructions in, for example, Intel architecture 32-bit (IA32) programs, the interprocessor (or inter core) communication and synchronization overhead becomes a substantial challenge. Existing techniques generally depend on special hardware to support the reduced overhead.
Jiuxing Li, Jiesheng Wu, Dhabaleswar K. Panda, “High Performance RDMA-based MPI Implementation over InfiniBand”, In Proceedings of the 17th Annual International Conference on Supercomputing, 2003, presents a remote direct memory access (RDMA)-based data communication approach to reduce the overhead in message passing between threads. The approach works on a distributed memory system where all the data communications are explicit through either a message passing interface (MPI) or RDMA write. Conversely, in a shared memory system, all the data communications may be implicit through the hardware cache coherence protocol. The underlining data communication mechanisms for these two approaches are so different that the RDMA approach cannot solve the problems of redundant data transfer and false sharing.