The present invention relates to a method of detecting a failure of a computer system, and to a main memory controller of computer systems. In particular, this invention relates to a technology that is effectively applied to an error detection and correction method, a main memory controller for computer systems, and a computer system preferably used to avoid a system failure derived from occurrence of an error and to specify an error source.
As a method of avoiding a system failure, when an uncorrectable error is detected in data to be written in a main memory over a CPU bus or an I/O bus, for example, Japanese Patent Laid-open No. 6-89196 has disclosed an approach described below. Namely, when a main memory controller detects an uncorrectable error in data transferred over a CPU bus or an I/O bus, certain received data is rewritten into data having a specific pattern. Check bits produced from the specific pattern data are all inverted. Consequently, data having all inverted check bits and being encoded according to a specific error correcting code is written in a main memory. When the data is read from the main memory, if a calculated syndrome exhibits an all-1 bit and the data has the specific pattern, the received data is judged as data struck with an uncorrectable error over the CPU bus or I/O bus. Consequently, fault information can be recorded without increasing the number of interface signals used to provide an interface with the main memory and needed to store fault information, and increasing the storage capacity of the main memory.
Moreover, in case where a fault recovery means for retrying an instruction transferred over a CPU bus or an I/O bus is not included, when a fault is detected, re-booting is not performed. Only when a CPU attempts to read the above-mentioned data, an interrupt is issued to the CPU in order to report that a fault has been detected. Even when fault-stricken data is written in the main memory, as long as the CPU does not attempt to read the data, the fault in the data can avoid a system failure (a system halt, re-booting, or any other failures directly recognized by a user). This contributes to improvement of system availability.
The present inventor has discussed aforesaid methods of constructing the code proposed in the prior art. Consequently, three drawbacks described below have become apparent. Namely, these drawbacks are that (1) when check bits are inverted, a syndrome calculated for received data exhibits an all-1 bit pattern and, consequently, has a multi-bit error pattern whose occurrence frequency is low; (2) if another one-bit error occurs in the main memory, the data encoded according to the specific error correcting code may be wrongly corrected; (3) since the data is rewritten to have a specific pattern, the original pattern of the data cannot be referenced. These drawbacks will be further described by taking examples.
To begin with, the drawbacks (1) and (2) will be described by taking introductory remarks. For brief sake, a single-bit error correcting/double-bit error detecting code (SEC-DED code) will be taken for instance. The SEC-DED code is defined, as shown in FIG. 16, such that a code length is eight bits and a check bit length is four bits. For a description of the code, refer to xe2x80x9cError-Control Coding for Computer Systemsxe2x80x9d (P.140) written by T. R. N. Rao and E. Fujiwara.
FIG. 16 shows an example of a parity-check matrix H (hereinafter, matrix H) and an example of arrangement of information bits and check bits. Each of the column vectors of the matrix H is referred to as h0, h1, . . . , h7. FIG. 17 implies an example of the drawback (1). Assuming that a two-bit error occurs to involve bit positions d0 and c3 shown in FIG. 16, as a syndrome S an all-1 bit pattern is produced. Depending on a way of constructing a code, even if a syndrome produced exhibits an all-1 bit pattern, a multi-bit whose occurrence frequency is low is not detected as an error.
Referring to FIG. 18, the drawback (2) will be described. As shown in (1) of FIG. 18, an encoded word is [00000000], check bits are all inverted according to the conventional method, whereby data d=[00001111] is produced. Thereafter, a one-bit error occurs as shown in (2) of FIG. 18. The data struck with the error is [00001110]. A syndrome for the data is, as shown in (3) of FIG. 18, calculated using the matrix H shown in FIG. 16. The syndrome corresponds to the column vector h0 in the matrix H shown in FIG. 16. Consequently, it is judged that a one-bit error has occurred at the bit position d0. Eventually, the data is wrongly corrected into [10001110].
When the conventional method is adapted to an error control code generally implemented in computer systems, the drawbacks (1) and (2) may arise. Therefore, the conventional method cannot be applied to all error control codes but can be applied to the error control code that employs the matrix H of a specific bit pattern. However, the related art does not refer to what kind of code is applied to.
Next, the drawback (3) will be described below. Several patterns of data in which an error was detected were inspected. Consequently, data whose specific bit is struck with a stuck-at-zero error may be produced. If the patterns of such fault-stricken data are kept, they may help to analyze cause of the error. Therefore, if the patterns of fault-stricken data are discarded, it takes much time to analyze the cause of the error, thereby causing a Mean-Time-To-Repair (MTTR) to increase.
An object of the present invention is to provide an error detection and correction method be capable of encoding data so as to keep, as fault information, a detected result of uncorrectable error in an input data without changing the number of bits constituting the encoded word, and storing the resultant data in a main memory. Moreover, this method can avoid such a situation that the data is wrongly corrected in decoding the encoded data because of a failure to reproduce the fault information.
Another object of the present invention is to provide an error detection and correction method that does not discard the pattern of fault-stricken data and not hinder analysis of cause of an error.
Still another object of the present invention is to provide an error detection and correction method making it possible to accomplish the above objects without greatly modifying a known encoding circuit or decoding circuit.
These and other objects of the present invention and novel features thereof will be apparent from the description of this specification and the appended drawings.
The representative aspects of the present invention disclosed in this specification will be briefed below.
To begin with, the gist of the present invention will be described using the error control code described in conjunction with FIG. 16. The SEC-DED code implied in FIG. 16 is defined such that a code length is eight bits and an information bit length is four bits. The SEC-DED code may be referred to as (8, 4) SEC-DED code. Hereafter, the maximum code length in the SEC-DED code, in which the number of check bits is four, is known to being eight bits as described in page 139 of the above-mentioned literature. When the number of information bits that must be protected by an error control code is 2, column vectors associated with bit positions unallocated to information bits are deleted from the matrix H as shown in FIG. 1, by the number of unused bits in the information bits. A SEC-DED code employing the resultant matrix is therefore a (6, 2) SEC-DED code. When data is encoded with the bit positions of necessary information bits alone associated with column vectors of the matrix H, a removed code is referred to as a shortened code. The underlying idea of the present invention is that fault information is allocated to the bit positions associated with the deleted column vectors.
In an example shown in FIG. 2, bits of fault information e0 and e1 are allocated to the bit positions associated with the deleted column vectors. FIG. 3(1) and FIG. 3(2) show how bits are arranged in an encoded word. Normally, when data is encoded without fault information appended thereto, 0s (zeros) are arranged as fault information as shown in FIG. 3(1). FIG. 4 describes an encoding and decoding procedure. Data 70 to be encoded and fault information 71 indicating whether the data 70 is struck with an error are encoded using the matrix H that is defined in the (8, 4) SEC-DED code described in conjunction with FIG. 2. Encoding means to produce check bits c0, c1, c2, and c3 in practice. The fault information is removed from a encoded code 73 produced, and the remaining bits [d0 d1 c0 c1 c2 c3] alone are transmitted over a communication line or to a memory 75. Data 76 received over the communication line or from the memory 75 is decoded by using the matrix H which is defined in the (8, 4) SEC-DED code described in conjunction with FIG. 2, on the assumption that the fault information represents fixed bits of [0 0]. Data 79 is obtained by decoding 78 the encoded data 78. Decoding means to produce a syndrome from a encoded word received, decode the syndrome, and invert a bit position at which an error has occurred. What is important herein is that fault information is not transmitted over a communication line or to a memory. Namely, data 70 received from a transmitting side is decoded at a receiving side, on the assumption that the data is always struck with no error.
When fault information is appended to data, the fault information is, as shown in FIG. 3(2), allocated to a bit position e0 or e1. As shown in FIG. 5, data produced by giving fault information 80 to bit positions e0 and e1, is encoded 72. Received data is decoded on the assumption that the fault information represents 00. Therefore, a leading one bit is judged to be an error. Consequently, data 81, in which the fault information is reproduced, is obtained by the decoding 78. However, as long as the code employed in this example is adopted, each bit allocated at both the bit positions e0 and e1 can never be 1. This is because, as the error correcting capacity of the employed code is one bit, information of only one bit at most can be reproduced. By adopting a code having a more powerful error-correcting capacity than the SEC-DED code, the amount of fault information that can be transferred at one time can be increased. When a shortened code is employed, as mentioned above, if the fault information is appended to the shortened code, the aforesaid drawback (1) can be solved. In this case, a data pattern attained, before being encoded, is kept as it is. Normally, although employed code lengths are 32 bits, 64 bits, or 128 bits. the numbers of check bits permitted by the SEC-DED code relative to the code lengths are 7 bits, 8 bits, or 9 bits, respectively. Maximum lengths of the information bits permitted by the SEC-DED code relative to the numbers of check bits are 57 bits, 120 bits, or 247 bits, respectively. In either case, since the shortened code is employed, the method in accordance with the present invention can be used normally.
Further, FIG. 6 shows that the drawback (2) can be solved. Referring to FIG. 6, if data 70 is encoded (72) together with fault information 80, and part 90 of the data transferred over a communication line or to a memory 75 becomes an error, then received data 91 to be decoded is struck with an error at two bit positions 92 and 93. The error corresponds to a two-bit error and, therefore, it is possible to detect such error.
A generalized example of the above case will be described in conjunction with FIG. 7A and FIG. 7B. In FIG. 7A and FIG. 7B, there are shown a matrix He having a size ((mxe2x88x92k)xc3x97r), a matrix Hd having a size (kxc3x97r), and a unit matrix Ir having a size (rxc3x97r). (He, Hd) is associated with information bit positions, while the matrix Ir is associated with check bit positions. The matrix His assumed to have a maximum code length of m+r bits comprising a maximum information bit length of m bits and a check bit length of r bits; t (t greater than 0) bit error correcting codes; and u (u greater than t) bit error detecting codes. Incidentally, in claims 1 to 8, t is replaced with c and u is replaced with d. Herein, if the information bit length employed is k (m greater than k greater than 0) bits, fault information of min (t, mxe2x88x92k) bits at most can be encoded. At this time, min (a, b)=a at axe2x89xa6b and min (a, b)=b at axe2x89xa7b. Furthermore, even if data having the fault information appended thereto is struck with an error of u-min(t, mxe2x88x92k) bits over a communication line or in a memory, the fault-stricken data is not wrongly corrected. FIG. 7B shows a data position in an encoded word by using the matrix H shown in FIG. 7A. Fault information is allocated to mxe2x88x92k bit positions e0, e1, . . . , e(mxe2x88x92kxe2x88x921). Information bits are allocated to k bit positions d0, d1, . . . , d(kxe2x88x921). Check bits are allocated to r bit positions c0, c1, . . . , c(rxe2x88x921). However, the allocation of bits can be modified by switching bit positions to such an extent that the nature of the code is not altered. The present invention provides an error detecting and correcting means characterized by the coding system described in conjunction with FIG. 7A and FIG. 7B.
Moreover, there is provided an error detecting and correcting means that makes the most of the fact that the sum of encoded words is an encoded word. Namely, when data to be transferred over a communication line or to a memory is encoded according to an error control code, data to be encoded and used for arithmetic operations by a CPU is not integrated with fault information of the data in order to produce check bits. Instead, an existing data encoding means is separated from a fault information encoding means. Check bits produced for the data and check bits produced for the fault information are linearly added to each other and then encoded.
Moreover, the objects of the present invention can be solved by using a computer system described below. The system consists mainly of processors, a main memory, a main memory controller, an I/O unit, a processor bus, a memory bus, and an I/O bus. The main memory controller stores data encoded according to an error control code. A plurality of processors is connected to the main memory controller over the processor bus. The main memory is connected to the main memory controller over the memory bus. The I/O unit is connected to the main memory controller over the I/O bus. When the main memory controller is connected to another main memory controller, the main memory controller included in the computer system is provided with a crossbar switch input/output control unit used to connect the main memory controller to a crossbar switch. In this case, the main memory controller includes a circuit for detecting an error in data on the processor bus, a circuit for detecting an error on the I/O bus, a circuit for detecting an error in data transferred from the crossbar switch, data written into the main memory, an encoder circuit, a decoding circuit, and a fault information detection table. The encoder circuit produces check bits from information indicating that an uncorrectable error is detected in data transferred over the processor bus and written in the main memory, or information indicating that an uncorrectable error has been detected in data transferred over the I/O bus and written in the main memory. The decoding circuit produces a syndrome for data read from the main memory, and detects and corrects an error according to the bit pattern of the syndrome. The fault information detection table makes the detection of whether the bit pattern of the syndrome produced by the decoding circuit is a specified pattern, in order to identify the source of an uncorrectable error that has occurred before encoding performed by the encoder circuit.