1. Field of the Invention
The present invention relates to a computer system, a bus controller, and a bus fault handling method used in the same computer system and bus controller, and specifically to a PCI (Peripheral Component Interconnect) bus controller controlling a PCI bus in a computer system.
2. Description of the Prior Art
Many types of computer systems have been developed and used in various fields. FIG. 11 shows an example of a minimum configuration of a computer system. This configuration has further been developed into a multiprocessor system consisting of 32 or 64 processors, with more processors or PCI devices connected to a single system, or a multiprocessor system including more than one hundred PCI bus slots.
In general, such large-scale computer systems are often used for applications that require high reliability and fault tolerance (mission critical applications). Thus, the computer systems used for those applications are required of high availability. Therefore, a technology to minimize an influence of error propagation at fault detection is needed.
FIG. 11 shows an example of a computer system configuration including a conventional PCI bus. In FIG. 11, processor (CPU: central processing unit) 61 is connected to memory controller (MMC: Main Memory Control) 62 via processor bus (FSB: Front Side Bus) 100.
Memory controller 62 is provided with I/F (interface) with main memory (DIMM: Dual In-line Memory Module) 63 and with I/O (Input/output) controller (10C: Input/output control) 64 as well as I/F with processor 61.
Memory controller 62 is a unit consisting of one or more LSIs (large-scale integration) depending on the scale or configuration of a system and routs a transaction received from processor 61 and I/O controller 64. Main memory 63 stores an OS (Operating System) 631 including PCI device driver 632.
I/O controller 64 is provided with I/F with memory controller 62 and incorporates PCI bus controller (PBC: PCI Bus Control) 65, which controls PCI bus 200 subordinately connected to I/O controller 64. I/O controller 64 is a unit consisting of one or more LSIs depending on the scale or configuration of a system. PCI bus 200 can be connected with a plurality of PCI devices (peripheral devices) (not shown).
The above-mentioned OS 631 and PCI device driver 632 control the input/output of signals or data between application program 71 and PCI device [hardware (HW) 72] as shown in FIG. 12. OS 631 monitors a flow of signals or data between PCI device driver 632 and application program 71 to detect a fault on PCI bus 200.
In the above-mentioned computer system, a direct access from processor 61 to a PCI device or an access originating from a PCI device to main memory 63 occurs. Those accesses will be described with reference to FIG. 11.
First, a direct access from processor 61 to a PCI device will be described with the case of an I/O read [Outbound Read] from processor 61. In the case of an I/O read from processor 61, a read transaction from processor 61 to a PCI device is input into memory controller 62 then into I/O controller 64 via processor bus 100.
The read transaction arrived at I/O controller 64 is converted into a PCI bus transaction at PCI bus controller 65, sent out to PCI bus 200, and arrives at a targeted PCI device. As this transmission of a transaction over PCI bus 200 is in a common PCI cycle (memory cycle, I/O cycle, configuration cycle, etc.) and is generally known, the description of it will be omitted.
Then, a reply or read data from the PCI device returns in the opposite direction to the above route; from PCI bus 200 to PCI bus controller 65 then to memory controller 62 and to processor 61 via processor bus 100, which sent the transaction.
If a transaction fails to be sent out to PCI bus 200 by a fault or the like, the system operates as below: When an error on PCI bus 200 such as address parity error or the like occurs, the PCI device detects the error and drives system error line SERR (System Error) #, and PCI bus controller 65 that detected the error informs processor 61 of the error by means of an NMI (Non Mask Interrupt) signal line.
When an error on PCI bus 200 such as data parity error or the like is detected, PCI bus controller 65 detects the error, drives parity error line PERR (Parity Error) #, and returns an error reply instead of a read reply. Unlike a normal reply, an error reply informs processor 61 that a transaction sent from processor 61 does not complete normally.
Next, an I/O write in a direct access from processor 61 to a PCI device [Outbound Write] will be described. Two types of transactions are defined as a write from processor 61 to a PCI device: A Deferred type write (Deferred Write) where processor 61 waits for a response indicating the completion of writing into a PCI device and a Posted type write (Posted Write) where processor 61 does not wait for a response indicating the completion of writing.
A Deferred Write transaction is routed to PCI bus controller 65 along the same route as the above-mentioned route for an I/O read. The routed write transaction is converted into a PCI bus transaction at PCI bus controller 65, sent out to PCI bus 200, and arrives at a targeted PCI device. As this transmission of a transaction over PCI bus 200 is in a common PCI cycle and is generally known, the description of it will be omitted.
Then, after confirming all the data is sent (completion of a PCI cycle), PCI bus controller 65 issues a write reply. The write reply returns in the opposite direction to the above route; via memory controller 62 and processor bus 100 to processor 61, which sent the transaction.
In the case of Posted write transaction, processor 61 considers that a writing operation completed when the transaction is sent out, routs the sent out Posted Write to a targeted PCI device, and finishes an operation as the transaction when all the data is sent out to PCI bus 200.
If a transaction fails to be sent out to PCI bus 200 by a fault or the like, the system operates as below: When an error on PCI bus 200 such as address parity error or the like occurs, the PCI device detects the error and drives system error line SERR #, and PCI bus controller 65 that detected the error informs processor 61 of the error by means of an NMI signal line.
When an error on PCI bus 200 such as data parity error or the like is detected, the PCI device drives parity error line PERR (Parity Error) #, and PCI bus controller 65 detects the error and informs processor 61 of the error. For a data parity error in Deferred Write, PCI bus controller 65 sends out an error reply instead of a normal reply. For data parity error in Posted write, PCI bus controller 65 informs processor 61 of the error by means of an NMI signal line.
Now, a memory read from a PCI device subordinate to PCI bus 200 [Inbound Read] among accesses originating from a PCI device to main memory 63 will be described. A read transaction sent out to PCI bus 200 is input into PCI bus controller 65 in I/O controller 64. PCI bus controller 65 received the read transaction converts the PCI transaction into a transaction to be used in the platform, and sends the transaction out to memory controller 62. Memory controller 62 accesses main memory 63 according to the transaction received form I/O controller 64.
Reply data from main memory 63 is sent out from memory controller 62 to I/O controller 64 in the opposite direction to the above route, and sent to PCI bus controller 65 connected with the requesting PCI device. The routed reply data is converted into a PCI bus transaction at PCI bus controller 65, sent out to PCI bus 200, and sent to the PCI device, which sent the read. As this transmission of read data over PCI bus 200 is in a common PCI cycle and is generally known, the description of it will be omitted.
If a transaction fails to be sent out to PCI bus 200 by a fault or the like, the system operates as below: When an error in a PCI bus such as address parity error occurs, PCI bus controller 65 detects the error and drives system error line SERR #, while informing processor 61 of the error by means of an NMI signal line.
When a data parity error occurs while PCI bus controller 65 is sending read data to PCI bus 200, the PCI device detects the error, drives parity error line PERR#, and informs PCI bus controller 65 of the error. The PCI device that detected data parity error performs its specific error-handling.
Now, a memory write from a PCI device subordinate to PCI bus 200 [Inbound Write] among accesses originating from a PCI device to main memory 63 will be described. A write transaction from a PCI device subordinate to PCI bus 200 is input from PCI bus 200 to PCI bus controller 65 in I/O controller 64.
PCI bus controller 65 received the write transaction converts the PCI transaction into a transaction to be used in the platform and sends the transaction to memory controller 62. Memory controller 62 writes on main memory 63 according to the transaction received form I/O controller 64. As this transmission of write data over PCI bus 200 is in a common PCI cycle and is generally known, the description of it will be omitted.
If a transaction fails to be sent out to PCI bus 200 by a fault or the like, the system operates as below: When an error on a PCI bus 200 such as address parity error or the like occurs, PCI bus controller 65 detects the error and drives system error line SERR #, while informing processor 61 of the error by means of an NMI signal line.
When a data parity error occurs while a PCI device is sending write data to PCI bus 200, PCI bus controller 65 detects the error, drives parity error line PERR# and informs the PCI device of the error. The PCI device that received the information performs its specific error-handling.
[Patent Document 1] Japanese Patent Laid-Open No. 2001-273200
The above-mentioned conventional computer system activates system error signal line SERR# when it detects an error such as an address parity error or the like that cannot determine the faulty transaction on a PCI bus, during an access from a processor to a PCI device or during an access from a PCI device to main memory.
When system error signal line SERR# is asserted, usually a processor (OS) is informed of the error occurrence by means of NMI signal line. If system error signal line SERR# is synchronized with a clock, a plurality of devices is ready to be driven at the same time. In this manner, the processor can recognize the error occurrence but cannot determine the source of the error. Therefore, the processor cannot perform effective error-handling and just aborts the system (brings the system down) at the occurrence of NMI in order to prevent error propagation.
When a data parity error on a PCI bus is detected and informed a processor of the error in an error reply, special error handling with an exception handler is required. However, as specifications and controlling methods for a connected PCI device varies significantly among PCI devices, an exception handler cannot complete error handling of all the PCI devices by itself. Therefore, when an exception handler cannot perform effective error handling, it aborts the system (brings the system down) to prevent error propagation.
When a fault whose error source cannot be determined is informed by means of NMI, and when an exception handler cannot perform sufficient error handling in response to fault information in an error reply like the above example, system down is taken to prevent error propagation to the system. This is because the system lacks effective fault handling procedure at PCI bus fault.
Conventional computer systems have a problem in taking system down in order to prevent error propagation to the system, when a PCI bus fault is informed to a processor by means of an NMI signal line or an error reply and an exception handler of an OS is to handle the error as mentioned above, but the exception handler cannot provide sufficient error handling.