As transistors that are integrated to form electronic hardware, such as processors, become smaller and faster, electronic hardware formed from these transistors becomes more susceptible to hardware faults. Without regard to cause, these errors may result in incorrect software and/or firmware program execution by altering signal transfers or stored values. With many-core computer platforms, recovery from these incorrect program executions is critical. Leaving aside incorrect program execution, the overhead time spent synchronizing threads (e.g., tasks, code streams) execution is an appreciable portion of execution time; considering incorrect program execution, the time spent managing execution (e.g., synchronization and error detection) becomes substantially greater.
Hardware failure mechanisms, whose cause may immensely vary, are generally classified according to duration in a faulted state as either permanent or transient errors. A permanent error refers to a lasting damage of a device from which recovery from the damage is not attained. For example, a permanent error may result from a damaged memory cell or register. Alternatively, transient errors (i.e., soft-errors, single-event upsets, SEUs), are short-term disturbances that change the internal logical state of a device (e.g., processor, memory, etc.) without causing permanent damage to the device. Computer platforms having complete fault-tolerance must be capable of handling both error types, while minimizing execution latency.
Most modern microprocessors already incorporate certain mechanisms for detecting transient errors, such as soft-core errors, through hardware. Memory elements, particularly caches of modern systems, are protected by mechanisms such as error-correcting codes (“ECC”) and parity checking. The error protection in these systems is typically focused on memory because such techniques are well understood and do not require expensive extra circuitry. Moreover, caches take up a large part of the chip area in modern microprocessors. Hardware-based approaches to error correction generally rely on inserting redundant hardware.
As mentioned above, difficulties in managing thread concurrency are appreciable, and with increasing cores on computing platforms these difficulties become increasingly appreciable. In resolution, a method of execution using transactional memory has been proposed to simplify concurrency management by supporting parallel tasks, i.e., transactions, that appear to execute atomically and in isolation. Using transactions and/or transactional memory, multi-core computer platforms can achieve increased parallel performance with identified coarse-grained transactions.
When using transactional approaches, programmers define atomic code sequences (i.e., transactions) that may include unstructured flow control and any number of memory references. The transactional memory system executes transactions correctly by generally providing: (1) atomicity, which means that either the whole transaction executes or none of it; (2) isolation, which means that partial memory updates are not visible to other transactions; and (3) consistency meaning there appears to be a single transaction completion order across the entire system. If these conditions are met at the end of its execution, the transaction commits its writes to shared memory. If not, the transaction violates one or more of these conditions, and the transaction writes are rolled back.