1. Field of the Invention.
The invention relates to apparatus for detecting and correcting errors in data stored in a data processor.
2. Description of the Related Art.
A typical data processor includes a central processing unit (CPU) which executes instructions which either process data or cause the transfer of data among different functional units of the data processor. A main storage unit having a relatively large storage capacity ordinarily stores programs and data used by the programs. Data to be processed by the central processing unit ordinarily is transferred from the main storage unit to an intermediate storage unit, having cache memory, before processing actually begins. The cache memory interfaces directly with the central processing unit. Usually, the cache has a relatively low storage capacity but operates at relatively high speed to provide the data to the central processing unit for use during execution of a corresponding program.
Frequently, a variety of processes share the use of the CPU. Moreover, often the CPU interrupts the execution of a program corresponding to a first active process in order to execute a program corresponding to a second process which takes precedence over the first active process. In order to execute the program corresponding to the second process, however, it ordinarily is necessary for the data corresponding to the second process to be moved into the cache. Consequently, often it is necessary to move out the data corresponding to the first process from the cache in order to make room for the data corresponding to the second process. Typically, the data corresponding to the first process is moved into the main storage unit for storage during execution by the CPU of the program corresponding to the second process.
Subsequently, after the CPU has executed the program corresponding to the second process, the data corresponding to the first process, once again can be moved into the cache. Moreover, a typical data processor ordinarily includes a data storage control system which controls the transfer of data between the cache and the main storage unit such that data associated with programs actively being executed by the central processing unit can be moved into the cache, and data associated with processes to be executed later by the CPU can be moved out of the cache and into the main storage unit.
One problem associated with the storage of data by a data processor in general, and associated with the transfer of data between the cache and the main storage unit in particular, stems from the occurrence of errors in the stored data. Errors are manifested, for example, as unwanted changes in the binary state of bits within a byte or line of data. Errors can occur in a variety of locations such as in the cache, in the main storage unit or in the course of transferring the data between the cache and the main storage unit. Since data errors detrimentally affect the performance of the data processor, the data storage control system ordinarily includes components directed to detecting and reporting such errors.
For example, in the past, error checking and correcting (ECC) codes frequently were used to detect errors occurring in data stored in the cache and to correct certain of the errors. More specifically, an ECC code was generated by apparatus in the intermediate storage unit each time data was moved into the cache. The ECC code, for example, could comprise a set of single bit binary signals, each of which represented a parity bit covering a particular set of data bits, each respective data bit being covered by more than one ECC code bit. The ECC code was stored in the cache in conjunction with the corresponding data. Subsequently, when the data was moved out of the cache, apparatus in the intermediate storage unit used the ECC code to detect the occurrence of errors in the data and to correct certain of those errors.
Since the intermediate storage unit and the main storage unit often were physically spaced apart within the data processor by a relatively significant distance, errors often could occur in the course of the transfer. Consequently, the data typically was covered by parity during the transfer in order to detect occurrences of errors in the course of the transfer.
For example, commonly assigned patent application Ser. No. 06/527,672 filed Aug. 30, 1983, now U.S. Pat. No. 4,625,273, issued 11-25-86, entitled, APPARATUS FOR FAST DATA STORAGE WITH DEFERRED ERROR REPORTING and commonly assigned continuation application Ser. No. 790,269 filed Oct. 22, 1985, now abandoned, entitled, APPARATUS FOR STORING DATA WITH DEFERRED UNCORRECTABLE ERROR REPORTING which is a continuation of commonly assigned application Ser. No. 527,621, filed Aug. 29, 1983, now U.S. Pat. No. 4,546,329, issued 10-8-85, generally pertain to the reporting of errors present or occurring in the course of the move-out of data signals from a cache to a main storage unit.
The move-in process for retrieving a requested operand from a main storage unit is a comparatively time-consuming operation in a computer. In high speed pipelined machines, the instruction and operand processing unit pipeline may interlock while awaiting the supply of a requested operand. If the operand resides in a line missing from the cache, the lengthy move-in process will result in an undesirably long interlock of the instruction and processing unit pipeline. Thus it is desirable to reduce the time required, called the cache miss penalty, for supplying a requested operand to the instruction and operand processing complex from a line missing state.
One example of a prior art solution to the problem of reducing a wait for a requested operand in a line missing state is described in U.S. patent application Ser. No. 527,673 filed Aug. 30, 1983, ,now U.S. Pat. No. 4,742,454 issued 5-3-88, entitled APPARATUS FOR BUFFER CONTROL BYPASS. In the APPARATUS FOR BUFFER CONTROL BYPASS application, the control of the buffer is modified in the line missing state so that a quicker transfer of a requested operand can occur. This buffer control bypass occurs because data in the cache is stored in units known as lines, while operands requested by the instruction and operand processing unit are typically less than an entire line of data. Further, when a line of data is being moved into the cache, it comes in a plurality of segments or flows, such as quarterlines. It was found in the buffer control bypass system, that a line being moved in can be aligned to provide the requested operand from the move-in register to the cache first. Buffer control can then be bypassed to allow a read of the requested operand from the data location in the cache before the balance of the line is written to the cache. This was found to result in a significant improvement in system performance by reducing the waits caused by a line missing state.
Another example of a prior art approach to reducing the overhead in time of a line missing state has been to provide a data path in the intermediate storage unit directly from the output of error checking and correcting logic in such intermediate storage unit, through a bypass data register for holding the requested operand latched in parallel with the move-in register. In addition, a complicated bypass match logic was required to indicate when the data in the bypass data register was the requested operand. By moving the bypass path back effectively to the move-in register in the intermediate storage unit prior to the cache, a significant savings was accomplished over the control bypass scheme; however, a significant penalty in logic complexity was paid.
Even when the delay in the move-in data path in the intermediate storage unit is minimized, the delay in reading data from the main storage array and providing it to the intermediate storage unit can still slow the operation of the data processing system. The error checking and correcting logic in the MSU, as distinguished from the error checking and correcting logic in the intermediate storage unit, is a significant component of this delay.
Broadly stated, ECC codes provide a method of adding redundancy to data. Techniques have been devised which generate a minimal number of ECC bits to be stored in association with a given block of data, called a checking block, which bits may be subsequently analyzed to determine whether an error has occurred, and if so, how it can be corrected. One tradeoff available to the designer with these techniques is that the greater the number of ECC bits used with a given checking block size, the greater the number of errors within a checking block which may be detected and/or corrected. For example, a 5-bit ECC may permit detection of double-bit errors and correction of single-bit errors, but a larger ECC, possibly made up of seven bits, may permit detection of triple-bit errors and correction of double bit errors.
Typically, the entire width of a data path or storage array is used as the checking block. The mathematics of ECC codes teaches that the ratio of ECC bits to data bits decreases as the size of the checking block over which the ECC code is generated increases, for a fixed n-bit error detection and m-bit error correction capability. As data paths and storage arrays for high speed computers have become wider, therefore, the trend has been to generate and check ECC codes over these larger blocks. Designers have taken advantage of the savings in the total number of ECC bits either directly, by minimizing the total number of bits (data plus ECC) which need to be stored in a storage array or sent along a data path, or indirectly, by using the saved bits to enlarge the ECC code and thereby improve the error detection and/or correction capability as described above. Both alternatives, however, suffer from the fact that additional levels of logic are required as the size of the checking block increases. Additional levels of logic slow the process of detecting and correcting errors and add undesirable delay in the move-in data path.
This problem is particularly acute in a machine such as the Amdahl 5890 line of computers, in which a line of data is provided from the main store in four sequential flows of 16 bytes (one quarterline) per flow. Error checking and correcting based on a single ECC code over all 16 bytes in a flow would be extremely slow, requiring more time than is available in one clock cycle to complete.
The problem is additionally complicated by the fact that the main storage unit for large mainframe computers is often too large to fit on a single printed circuit backplane. The storage unit may therefore be divided into a plurality, for example four sub-units, each with its own backplane and each providing a portion of any given flow of data. In order to generate and check ECC codes over an entire flow of data, therefore, long cables may be required to cross-couple the sub-units to pass partial XOR results back and forth. These long cables increase the time required to generate and check ECC codes significantly.