Personal computer systems have a number of component integrated circuit (IC) devices that communicate with each other over a bus. Traditionally, a parallel, multi-drop bus is used to connect three or more devices, where each of the devices is connected in parallel, to the same set of transmission lines that make up the bus. More recently, serial, point-to-point buses which consist of one or more serial links that connect only two devices have been introduced in advanced, personal computer systems (e.g., Peripheral Component Interconnect, PCI, Express bus systems). To increase throughput, the parallel bus has several data lines as well as several address lines that can simultaneously carry information between two devices that are communicating with each other. The bus also has control lines that carry corresponding control signals, where these may include device select, device read, device write and clock signals (the latter being used for synchronous systems, that is where two devices communicate with each other in sync with a common clock).
A personal computer hardware platform that is based on a Pentium® processor by Intel Corp., Santa Clara, Calif., calls for a central processing unit (CPU), that may consist of one or more processors, communicating with a system interface chipset over a front side bus. The chipset may include a north bridge which allows the CPU to communicate with one or more parallel multi-drop buses in the system, e.g. a memory subsystem bus such as a synchronous dynamic random access memory (SDRAM) bus, a Peripheral Component Interconnect (PCI) bus, an Industry Standard Architecture (ISA) bus, an Advanced Graphics Port (AGP) bus, and an Advanced Technology Attachment (ATA) bus. A device may be part of a larger module, such as a dual inline memory module, or an add-in PCI card. The module or card has an electrical connector that has a number of pins which are to make contact with corresponding pins that are part of a bus connector or bus slot.
In current consumer grade personal computer systems, each parallel bus may have upwards of twenty-five pins that are required to be properly connected with their corresponding pins in a module or a card that is inserted into its slot. This connection is susceptible to failure because of bent or broken pins that do not make contact or that cause a short with an adjacent pin. Dust or debris can also be lodged against a pin thereby preventing a good electrical connection. Conventional, parallel multi-drop bus protocols respond to such failures by ceasing all communication over the bus. For example, according to the PCI protocol, if a device detects an address or a data phase error during a bus transaction, a predetermined signal is asserted by the device, where this signal is connected to error logic in a bridge that in turn interrupts the CPU. After some error logging, the system shuts down.
Conventional parallel bus protocols used in consumer grade computer systems expect a reliable connection between IC devices that are to be connected by the bus. In other words, if a device fails to pass a handshake with a bus master, then the bus master will ignore the device, that is it will indicate to the operating system that no such device is present in the system. If the device was part of a module or a card that causes one or more wires of the bus to exhibit a short circuit, then a conventional bus protocol would essentially ignore all devices on that bus, making the bus nonfunctional. If any one of such failing devices are part of a primary component of the system (e.g., main memory), then the system will shut itself down as a result.
Although error detection and correction mechanisms are used in personal computer bus protocols and in particular in main memory systems, such protocols only detect error in the storage or transmission of typically a single bit (among each multi-bit word being transferred through the bus). They make no attempt to allow the system to continue to function, using the same bus, in the event of an uncorrectable error (e.g., more than one bit is in error, or the error persists).
Catastrophic shutdown may be avoided in systems that have redundancy, that is multiple buses connecting two devices, so that if one of the buses should exhibit an uncorrectable failure, then normal communications content is automatically routed to a backup or redundant bus. Although such a solution may be justified in mission critical systems, such as those used in aircraft and spacecraft, a redundant or backup bus may be prohibitively expensive in consumer grade computer systems that are mass produced for the public.