1. Field of the Invention
The invention relates to computing systems and, more particularly, to error handling in a multithreaded processing system.
2. Description of the Related Art
With the ever expanding use of computing devices has come an increasing dependence of users on the availability and reliability of those devices. For example, if a computer network goes down, or is otherwise unavailable, costs to an enterprise may be significant. In addition, if corrupted data or other errors go undetected, significant costs may be incurred as a result of those errors. Consequently, a number of techniques have arisen which are designed to ensure that computing devices are sufficiently robust that they may detect and respond to problems without significantly impacting users. Computing devices which have in place mechanisms to prevent hardware or software problems from impacting its users may be referred to as being Highly Available. Some of the characteristics which may be considered when defining High Availability include protection of data (Reliability), continuous access to data (Availability), and techniques for correcting problems which minimally impact users (Serviceability). Collectively these characteristics are frequently referred to as RAS.
In addition to Reliability, Availability, and Serviceability, a high level of system performance is also generally desired. To that end, the performance of the processors within a given system may be very important. For example, the processors within servers and other network devices may have a significant impact on overall system performance. Consequently, efforts have been made to improve the performance of the processors themselves. In many cases, increasing the performance of a processor entails increasing the complexity of the processor as well. This complexity frequently results from the addition of logic designed to perform additional functions, or perform functions with higher performance levels.
In addition to adding new or improved logic and circuitry, another technique which may be used to increase performance involves shrinking feature sizes. However, as technology feature sizes shrink, circuits become more susceptible to hardware errors. And incorrect operation may result if those errors occur and go undetected. Frequently, these types of errors affect memory circuits, such as those found in caches, translation look-aside buffers (TLBs), register files, store buffers, and other memory arrays used for temporary storage. Consequently, using these and other techniques to increase processor performance may result in a reduction of the Reliability, Availability, and Serviceability of a system.
In order to deal with errors, processors and other devices may be configured to detect these errors using a variety of mechanisms. When an error is detected, the processor or device typically attempts to recover using a variety of recovery schemes. At one end of the spectrum, the hardware itself may be configured to correct the effects of the error. For example, the hardware may be configured to correct an erroneous data item fetched from a cache. In addition to correcting the effects of an error, hardware may be configured to clear an error by eliminating or correcting the source of the error. For example, if a cache line is detected to be in error, the cache line may be refetched. Subsequently, the instruction which caused the error may be retried. Alternatively, hardware may only be configured to report the fact that an error has occurred. Based upon this report, diagnostic software or a service processor may then be responsible for correcting and/or clearing the error.
In addition to the performance enhancing techniques described above, highly threaded processors may be developed and utilized in order to improve performance. However, the use of highly threaded processors may exacerbate the complexity of handling hardware errors. In modern pipelined multithreaded processors, instructions from different threads may be executing in different portions of the machine simultaneously. Consequently, errors can occur simultaneously in a wide variety of hardware structures. For example, errors from the instruction translation lookaside buffer (TLB), instruction cache, integer register file, floating-point register file, arithmetic circuits, data TLB, data cache, store buffer, and other components such as level 2 (L2) caches and main memory arrays can all occur simultaneously. In addition, these errors can occur both for instructions within and across threads. Moreover, in a multicore processor, errors can occur across multiple cores or in structures which are shared by multiple cores, such as shared L2 caches.
Consequently, error recovery in a multi-threaded, multicore processor can quickly become very complex. Further, validating hardware error handling flows also becomes very complex. Complex error flows reduce the reliability of the design, since they increase the likelihood of a design flaw (bug) preventing proper error recovery when an error occurs. Still further, designing the correction and/or clearing of errors into the hardware generally reduces flexibility in how such errors are to be handled. Thus, as the number of threads increases, a design where hardware both corrects and clears errors may be infeasible given cost, time, power, and other constraints.
Accordingly, an effective method and mechanism for providing a high level of error detection and recovery capability while simplifying hardware error handling is desired.