The Peripheral Component Interconnect (PCI) bus issued by Intel in 1992 has been applied widely because the PCI bus meets the increasing demands of multimedia computers for bus bandwidth. It has such advantages as Plug and Play, being independent of processors and good extensibility. It can also extend bus bandwidth or operation frequency as demanded and keep the compatibility of software.
Along with the development of the PCI bus in the personal computer field, it is gradually applied to other fields including the server field, the notebook computer field, and the embedded system field. In 1994, the PCI Industrial Computer Manufacturers Group (PICMG) issued the Compact PCI specification, and extended the PCI bus to other fields which require high reliability, such as the telecommunication field, the industry control field, etc. In the Compact PCI specification, a Compact PCI system adopts the mechanical size of a Eurocard and has the same electrical characteristics and data transmission protocols as the PCI bus protocol. The Compact PCI system also supports Hot-Plugging, i.e., the Compact PCI is able to replace a board in an on-line manner. Furthermore, the Compact PCI system has good extensibility, specifically, the Compact PCI system is able to configure boards, the number of which is changeable as demanded. Additionally, the Compact PCI system supports active and standby board switching so as to improve system reliability, and adopts chips mass-produced in the personal computer field and so as to reduce the cost.
A typical Compact PCI system with 8 slots is shown in FIG. 1. The Compact PCI system is a structure including front boards and back boards, in which the front boards provide general processing capabilities while the back boards provide interfaces to the outside. The front boards include a system board for implementing the management and control of the Compact PCI system and service boards for implementing service processing. The Compact PCI system has a bus topology structure, in which interaction between the system board and a service board and between two service boards may be implemented via a bus. However, when a certain board fails, it is difficult to isolate the failure of the board, which easily influences other boards and results in the failure of the whole Compact PCI system.
The communication process between two service boards connected to one Compact PCI bus is described as an example. A simplified schematic diagram of Service board 1 accessing Service board 2 is shown in FIG. 2. The CPU of Service board 1 initiates an access to the memory of Service board 2, access information of Service board 1 is transmitted to the Compact PCI bus through the host bridge and the PCI to PCI (P2P) bridge of Service board 1 and then transmitted to the P2P bridge of Service board 2. A simplified schematic diagram of Service board 2 responding to Service board 1 is shown in FIG. 3. The P2P bridge of Service board 2 responds to the access and transmits the access information to the PCI bus in Service board 2; the host bridge of Service board 2 responds to the access of the P2P bridge as a target device, receives the access information, and writes the access information to the memory or reads data from the memory and passes the data to the P2P bridge; then the P2P bridge passes response information of the CPU to the Compact PCI bus and the Compact PCI bus passes the response information to Service board 1. However, if Service board 2 fails, e.g., the host bridge operates abnormally, it cannot respond to the access of the P2P bridge. In this case, the P2P bridge of Service board 2 transmits a retry response to the P2P bridge of Service board 1 and then the P2P bridge of Service board 1 transmits a retry response to the host bridge of Service board 1. For host bridges in some boards, if they receive a retry response after initiating an access, they will attempt to access the target board unceasingly until accessing the target board successfully. In this case, the failure of Service board 2 influences Service board 1, and thus other functions, such as inter-board communication, to be implemented by the host bridge of Service board 1, cannot be implemented. Moreover, if a board encounters such an abnormal operation, the board cannot send a reset signal to a watchdog circuit, which will result in abnormal reset of the board. The board will be hung up without the watchdog circuit.
Therefore, how to solve the hang-up of the Compact PCI bus caused by a board failure has become a major issue. Because the failure of a certain board on the Compact PCI bus being extended to other boards is caused by characteristics of the host bridge chip, in a first conventional technical solution, some host bridge chips with a retry count function are used in boards, and specifically, a retry times threshold is predetermined via software. The host bridge gives up an unsuccessful operation and continues with another operation when the times of the retry response exceed the retry times threshold, which avoids the hang-up of the Compact PCI bus caused by repeated retry. In a second conventional technical solution, a circuit, the function of which is similar to that of the above host bridge chip, is used in a board to detect the retry response; the host bridge will be made to give up this access in some way if the times of the retry response exceeds a certain threshold, which may acquire the same effect as the first technical solution.
In a third conventional technical solution, i.e., a method and system for monitoring a system bus, an access module to be monitored and its corresponding monitoring period, an expiring event module and its corresponding operation are set in advance. The operation of modules exchanging information with each other on the system bus is monitored. The monitoring period is counted down when the modules start information exchange with each other; if the information exchange between the modules is completed before the monitoring period is counted down to zero, it is determined that the access module operates normally, otherwise, the expiring event is performed as a response for the access module.
In the above first technical solution, the count function of the host bridge chip is used to restore the normal operation of the board, which depends on the type of the selected host bridge chip. However, not all host bridge chips have the count function. The adaptability of the above second technical solution in which the function of the host bridge chip is replaced with a circuit is preferable, but it's not cost-saving compared to the first technical solution. In the third technical solution, certain function modules are used to perform the monitoring period and expiring operation set in advance, which may reflect abnormal status of the module to be monitored in real time. However, the preset operations corresponding to an expiring event only includes such functions as response, notification and failure record. These functions are only for monitoring instead of locating and eliminating the failure, so the third technical solution still has certain limitation. The above solutions only solve the problem of the hang-up of a board caused by a failure and restore the normal operation of the boards affected by the failure, but cannot locate the failed board and restore the normal operation of the failed board. The failed board may continuously affect other boards exchanging information with it.