Computer processors comprise arithmetic, logic, and control circuitry that interpret and execute instructions from a computer program. Referring to FIG. 1, a typical computer system includes a microprocessor (22) having, among other things, a CPU (24), a system controller (26), and an on-chip cache memory (30). The microprocessor (22) is connected to external cache memory (32) and a main memory (34) that both hold data and program instructions to be executed by the microprocessor (22). Internally, the execution of program instructions is carried out by the CPU (24). Data needed by the CPU (24) to carry out an instruction are fetched by the memory controller (26) and loaded into internal registers (28) of the CPU (24). Upon command from the CPU (24) requiring memory data, the fast on-chip cache memory (30) is searched. If the data is not found, then the external cache memory (32) and the slow main memory (34) is searched in turn using the memory controller (26). Finding the data in the cache memory is referred to as a xe2x80x9chit.xe2x80x9d Not finding the data in the cache memory is referred to as a xe2x80x9cmiss.xe2x80x9d
The time between when a CPU requests data and when the data is retrieved and available for use by the CPU is termed the xe2x80x9clatencyxe2x80x9d of the system. If requested data is found in cache memory, i.e., a data hit occurs, the requested data can be accessed at the speed of the cache and the latency of the system is reduced. If, on the other hand, the data is not found in cache, i.e., a data miss occurs, and thus the data must be retrieved from the external cache or the main memory at increased latencies.
There are suitable devices for protecting memory areas in processors and for caching unmodified data. These devices allow data in the system to be protected from corruption if a data bit changes due to a cosmic event. In such a case, the processor either recovers or declares an unrecoverable event.
In computer systems, information is represented in binary format (1""s and 0""s). When binary information is passed from one point to another, there is a chance that a mistake can be made, e.g., a 1 interpreted as a 0 or a 0 interpreted as a 1. This can be caused by, system events, media defects, electronic noise, component failures, poor connections, deterioration due to age, and other factors. When data is mistakenly interpreted, an error has occurred.
It is normal for a computer system to encounter errors during system requests, e.g., reading and writing data, as part of its regular operation. As computer systems improve, tracks, sectors, and other memory locations are spaced closer together, signals used to prevent interference get weaker, and spin rates produced by a spindle motor get faster. These effects increase the likelihood of errors occurring when system requests are issued. However, having actual errors appear to a user while using a computer system is not desirable. Therefore, computer systems incorporate special techniques that allow them to detect and correct errors.
One type of error detection and correction in computer systems is through the inclusion of redundant information and special hardware or software to use it. Typically, a sector of data on a hard disk contains 512 bytes, or 4,096 bits, of user data. In addition to these bits, an additional number of bits are added to each sector for the implementation of error correcting code or ECC (sometimes also called error correction code or error correcting circuits). These bits do not contain data; rather, they contain information about the data that can be used to correct problems encountered trying to access the real data bits.
Further, the inclusion of self-checking circuitry for error detection is used. In such a case, the self-checking circuitry monitors the results of system requests to detect correctness cycle by cycle. In this approach, at least one location is designated in hardware that is responsible for comparing executing processes cycle by cycle. In other words, self-checking circuitry not only checks results of system requests from processes, but also compares the processes as they execute.
Another type of error detection in computer systems is the use of multiple processors that operate in a lock-step configuration. Such processors execute identical segments of a program and then compare their results to determine whether or not an error has occurred.
Multi-threaded processors exist such that when functions performed by a given thread in a processor come to a halt, e.g., when awaiting data to be returned from main memory after a read operation, the processor can perform other functions on a different thread in the meantime. These processors embody the ability to instantaneously switch execution flow, for example, from a Thread A to a Thread B, when Thread A is blocked from execution. As mentioned above, most often execution is blocked by waiting for an input-output (I/O) operation (typically, a read operation) to complete.
In general, in one aspect, the present invention is a self-checking multi-threaded processor comprising a first thread for generating a first I/O request; a second thread for generating a second I/O request; and a self-checking component for comparing the first I/O request and second I/O request. Processor operation is selectively suspended based on the comparison of the first I/O request and the second I/O request.
In general, in one aspect, the present invention is a method of self-checking multi-threaded processors, comprising processing code on a first thread to generate a first I/O request; processing the code on a second thread to generate a second I/O request; comparing the first I/O request and the second I/O request; and selectively suspending processor operation based on the comparison of the first I/O request and the second I/O request.
In general, in one aspect, the present invention is a self-checking apparatus comprising a first I/O request generating means; a second I/O request generating means; comparison means for comparing the first I/O request and the second I/O request; and processor operation suspension means for selectively suspending processor operation based on the comparison of the first I/O request and the second I/O request.
In general, in one aspect, the present invention is a method of self-checking multi-threaded processors comprising generating an I/O request on a Thread A; generating an I/O request on a Thread B; comparing the Thread A I/O request and the Thread B I/O request; suspending processor operation if the Thread A I/O request and the Thread B I/O request are different; and alternating between generating the I/O request on Thread A first and generating the I/O request on Thread B first when the processor operation is not suspended.
In general, in one aspect, the present invention is a method of self-checking multi-threaded processors comprising executing code on a Thread A until an I/O request generating event occurs; issuing the I/O request if the I/O request is a read operation; executing the code on a Thread B until an I/O request generating event occurs; comparing the I/O request generated by Thread B and the I/O request generated by Thread A; and suspending processor operation if the I/O request generated by Thread B and the I/O request generated by Thread A are different.
In general, in one aspect, the present invention is a method of self-checking multi-threaded processors comprising generating an I/O request on a Thread A; generating an I/O request on a Thread B; using the Thread B I/O request to confirm the Thread A I/O request; and suspending processor operation if the Thread A I/O request and the Thread B I/O request are different. Other aspects and advantages of the invention will be apparent from the following description and the appended claims.