1. Field of the Invention
This invention relates to fault-tolerant computer systems and more particularly to a protocol transmitted over dedicated maintenance bus for use with such computer systems.
2. Background Information
Fault-tolerant computer systems are employed in situations and environments that demand high reliability and minimal downtime. Such computer systems may be employed in the tracking of financial markets, the control and routing of telecommunications and in other mission-critical functions such as air traffic control.
A common technique for incorporating fault-tolerance into a computer system is to provide a degree of redundancy to various components. In other words, important components are often paired with one or more backup components of the same type. As such, two or more components may operate in a so-called lockstep mode in which each component performs the same task at the same time, while only one is typically called upon for delivery of information. Where data collisions, race conditions and other complications may limit the use of lockstep architecture, redundant components may be employed in a failover mode. In failover mode, one component is selected as a primary component that operates under normal circumstances. If a failure in the primary component is detected, then the primary component is bypassed and the secondary (or tertiary) redundant component is brought on line. A variety of initialization and switchover techniques are employed to make a transition from one component to another during runtime of the computer system. A primary goal of these techniques is to minimize downtime and corresponding loss of function and/or data.
Fault-tolerant computer systems are often costly to implement since many commercially available components are not specifically designed for use in redundant systems. It is desirable to adapt conventional components and their built-in architecture whenever possible. All modem computer systems have particular capabilities directed to control and monitoring of functions. For example, large microprocessor chips such as the Pentium III(trademark), available from Intel Corporation of Santa Clara, Calif., are designed to operate within a specific temperature range that is monitored by a commercially available environmental/temperature-sensing chip. One technique for interconnecting such an environmental monitor or other monitoring and control devices is to utilize a dedicated maintenance bus. The maintenance bus is typically separate system""s main data and control bus structure. The maintenance bus generally connects to a single, centralized point of control, often implemented as a peripheral component interconnect (PCI) device.
However, as discussed above, conventional maintenance bus architecture is not specifically designed for redundant operation. Accordingly, prior fault-tolerant systems have utilized a customized architecture for transmitting monitor and control signals over the system""s main buses (or dedicated proprietary buses) using, for example, a series of application specific integrated circuits (ASICs) mounted on each circuit board being monitored. To take advantage of current, commercially available maintenance bus architecture in a fault tolerant computing environment, a more comprehensive and cost-effective approach is needed.
Accordingly, it is an object of this invention to provide a protocol for use with a maintenance bus architecture that displays a high-degree of fault-tolerance. This maintenance bus architecture and associated protocol should be interoperable with commercially available components and should allow a fairly high degree of versatility in terms of monitoring and control of important computer system components.
This invention overcomes the disadvantages of the prior art by providing a protocol that is instantiated on a fault-tolerant maintenance bus architecture. The architecture includes two maintenance buses interconnecting each of a plurality of printed circuit boards, termed xe2x80x9cparentxe2x80x9d circuit boards. The two maintenance buses are each connected to a pair of system management modules (SMMs) that are configured to perform a variety of maintenance bus activities. The SMM can comprise any acceptable device for driving commands according to the protocol on the maintenance bus arrangement. The SMM has general knowledge of the circuit boards and their components. According to a preferred embodiment, the protocol is formatted to operate in accordance with Philips Semiconductors"" I2C maintenance bus standard. Other standards are expressly contemplated. Within each parent board are a pair of redundant bridges both having a unique address. One bridge is connected to the first maintenance bus while a second bridge is connected to the second maintenance bus of the pair. A child maintenance bus interconnects the two bridges through a xe2x80x9cchildxe2x80x9d printed circuit board. The introduction of a separate board to implement the child maintenance bus can be useful, but is not essential according to this invention. The child maintenance bus is itself interconnected with a variety of monitor and control functions on maintenance bus-compatible subsystem components. Using the protocol, the SMMs can address components on each child printed circuit board individually and receive appropriate responses therefrom based upon appropriate response identifiers within the protocol. In the event of a bus or bridge failure, the SMM can still communicate with the child subsystem components via the redundant bus and bridge.
The protocol includes a unique data packet structure. The command message initiated by the SMM includes a target bridge header, a command byte (wherein a non-zero byte code designates the message as a command rather than a response), the message size and a unique originator tag value. The command message further includes one or more bytes of forwarding data for subordinate bridges on the child bus (leading to and from remote components/circuitry). Next the command message has a response byte code to direct responses on the return trip through the bridge. The command message also includes one or more bytes of data to identify, and be used by, subsystem components. Finally the command message includes one checksum byte meant to sum up all previous message bytes.
A similar message packet is provided by the bridge in response to the command message. The response includes an SMM address byte and a zero-value command byte (indicating a response). Also provided is a byte indicating the overall message size in bytes and the identical tag originally provided in the command packet. The tags allow the SMM to verify that the response is to a particular transmitted command. A one-byte status code field and one-byte error message field are also provided. Unique status codes and error messages are generated by the bridge if a formatted message is incorrect or commanded action was not (or may not have been) taken by the subsystem. One or more bytes of response data delivered from the subsystem component or bridge is also provided in the response message. Finally, a checksum byte is provided for error checking.
Command message/data packets are transmitted by the SMM to be received by an appropriate component within a given time frame. If an expected response message/data packet is not returned from the component as expected, the SMM xe2x80x9ctimes-outxe2x80x9d and performs various error procedures that may include an alarm condition, system shut-down and/or retransmission of the packet.
The bridge can include an interconnection to a further bridge. This remote bridge can, itself, be interconnected to additional microprocessors and associated memory. The remote bridge is addressed through one of the parent board""s bridges so the communication to and from the SMM can occur. The forwarding data of the command packet enables the packets to be transferred through these further bridges, while stored response data in each subordinate bridge is used to route the return of a response back to the originating SMM.
The SMM can be interconnected with a variety of other computer system peripherals and components, and can be accessed over a local network or through an Internet-based communication network.