This invention relates to a system and method for monitoring a distributed fault tolerant computer system. In particular, the invention is directed to monitoring and reporting the operation status of nodes of a distributed fault tolerant computer system. The invention can find application to the automatic configuration of a distributed fault tolerant computer system.
One application for a distributed fault tolerant system is in the telecommunications industry. The telecommunications industry is going through some fundamental changes that have caused a significant shift in the requirements placed on their information infrastructure. Deregulation of the services provided by the carriers, introduction of new wireless services, and the addition of information processing (IP) services have created new challenges and opportunities in this rapidly growing industry. The competition in the industry has resulted in significant reduction in the time available to service providers to test and develop their own systems.
Traditionally, telecommunication companies that have relied on hardware fault tolerant systems and extensive testing of their applications to discover system and application software faults. However, the competition and the need to bring new services to the market quickly means that such an approach is no longer possible in all cases if the service providers are to provide new services while maintaining the level of service and reliability that their customers are accustomed to.
Distributed Fault Tolerant (DFT) systems provide the basis for one approach specifically to address the requirements of a changing telecommunication industry. A DFT system has the potential to tolerate not only the failures of the hardware components of the system, but also the failures of its software elements. A traditional lock-step hardware fault tolerant system is perfectly capable of masking hardware component failures from its users but it is unable to accomplish the same for a software failure. The difficulty arises from the fact that the redundant hardware components of such a system execute the same set of instructions at the same time on effectively the same system and are, therefore, subject to the same set of software failures.
While it is possible to discover and correct xe2x80x9cfunctionalxe2x80x9d bugs in the software by a rigorous qualification cycle, it is far more difficult to detect and correct the failures associated with the execution environment of a program. Such xe2x80x9cHeisenbugsxe2x80x9d, as they are called, are rarely discovered and corrected during the normal testing and qualification cycle of the system and occur only under circumstances that are very difficult to reproduce. The observation that the execution of the same program on the same (or identically configured) system, but at a different time, does not result in the same xe2x80x9cHeisenbugxe2x80x9d is the key to making it possible to tolerate such failures via redundancy, fault isolation, and fault containment techniques. DFT is based on this observation and uses redundant hardware and software components to achieve both hardware and software fault tolerance by isolating and containing the domain of such failures to a single member of the distributed system. Accordingly, it is desirable that a DFT system should be able to identify at least software failures that lead to the inoperability of a node of the system.
Moreover, in the telecommunications industry, stringent timing and availability requirements are set. Most applications in this market differ from those in other commercial sectors by the requirement for a xe2x80x9creal-timexe2x80x9d behavior. This places the requirement on the computing infrastructure that must incorporate the notion of xe2x80x9creal-timexe2x80x9d into its design and effectively guarantee that certain actions occur within a specified period. While it may be acceptable for a xe2x80x9cmission-criticalxe2x80x9d enterprise system to have a large degree of variance in the time that it takes to respond to the same service request at different times, such a non-deterministic behavior cannot be tolerated by a telecommunications computer system. In order to meet these stringent timing requirements, the industry has resorted to proprietary hardware and software components resulting in a complicated application development environment, increased time to market, and reluctance in adopting new and efficient programming techniques. It would be desirable to enable a DFT system to address the unique requirements of the telecommunications industry without introducing an unnecessarily complicated programming model. Thus, it would be desirable to use, wherever possible, standard Off-The-Shelf (OTS) hardware and software components that allow for application development in a modem environment. It would therefore be desirable to minimize the amount of special purpose hardware and software needed.
One of the most important requirements of a telecommunication computer system is its availability. This is typically measured in the percentage of time that the system is available. However, it can also be stated in terms of the time that the system is unavailable. From this figure it is possible to calculate the maximum length of service disruption due to a failure. However, such a derivation assumes that the maximum number of failures over a period of time is known and that failures (or unplanned outages) are the only cause of service unavailability. Instead, a second requirement is commonly used that determines the maximum length of the service unavailability due to a failure. Another requirement of a telecommunication computing system stems from its unique maintenance and service model. While it is perfectly reasonable to assume that an enterprise system will be serviced and maintained locally by a system administrator conversant in the current technology, such an assumption is not valid for a telecommunication system where the system is typically located in a Central Office (CO) miles away from the nearest suitable system administrator. This lack of trained service and maintenance personnel translates the implicit competence of such personnel into explicit system requirements. Accordingly, it would be desirable to provide a structure that provides the basis for achieving at least a degree of automation of fault reporting and system reconfiguration.
The invention seeks to provide a monitor system that provides the potential to address at least some of the problems and desires mentioned above.
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Combinations of features from the dependent claims may be combined with features of the independent claims as appropriate and not merely as explicitly set out in the claims.
In accordance with one aspect of the invention, there is provided a monitor system for a distributed fault tolerant computer system. The monitor system includes a counter mechanism operable to count from a reset value towards a fault value and to output a fault signal if the fault value is reached. A counter reset routine is implemented in software and is operable repeatedly to reset the counter mechanism to its reset value during normal operation of the counter reset routine, thus preventing the counter mechanism from reaching the fault value during normal software operation. A unit connectable to a bus to supply a status signal indicative of the status of the unit is arranged to be responsive to a fault signal being output from the counter mechanism to provide an OFF status indication to the bus.
In this manner, a monitor system is able to detect a fault in the software running on the node (for example if the operating system hangs) and to report this to the bus. This can be achieved through the minimum of special purpose hardware. Moreover, as will be described with respect to preferred embodiments of the invention, the monitor system provides the potential to achieve a degree of automation with respect to the reporting of faults and the configuration of the distributed fault tolerant system where hardware and/or software failures occur in a node and/or where a node is replaced and/or removed or added.
Preferably, each unit of respective nodes is connected to a respective channel on the bus, so that a fault condition from any particular unit can be identified by inspection of the channel on the bus. The channel could be implemented by, for example, time, phase or code division on a particular bus line or lines. In a preferred embodiment, which minimises the implementation logic required, each channel is a separate bus line.
A management subsystem is preferably employed to define a configuration for the distributed fault tolerant computer system. This management subsystem can be responsive to status signals on the bus and can be operable selectively to redefine the configuration of the distributed fault tolerant system dependent upon the state of the status signals. In this manner, a degree of automation with respect to the reporting of faults and the configuration of the distributed fault tolerant system can be achieved to take account of hardware and/or software failures that occur in a node and/or a situation where a node is replaced and/or removed or added.
The management subsystem can be responsive to respective status signals on respective channels to determine the state of respective nodes. The management subsystem can then be operable automatically to redefine the configuration of the distributed fault tolerant system in response to detection of a change of state of a node and to define a node as a member of the fault tolerant computer system when it is associated with an ON status signal.
The management subsystem could be provided centrally, possibly in one node, or alternatively could be provided at each node of the distributed fault tolerant computer system.
The counter mechanism could be formed of a hardware counter with a gate responsive to the counter reaching the fault value to pass a fault signal to the unit. The unit can be a power supply unit, and can be operable to turn off in response to a fault signal output by the counter mechanism. In this manner, the power supply for node that has developed a fault can be turned off.
In a preferred embodiment, each node includes two power supplies, with respective counter mechanisms, such that a fault in the power supply or in the associated counter mechanism will not result in the whole node being powered down. In such a preferred embodiment, first and second counter mechanisms and first and second power supplies are provided, both counter mechanisms being responsive to a common counter reset routine. Thus, where there is a software or hardware failure that prevents the counter reset routine from resetting the counter mechanisms, both counter mechanisms output a fault signal that causes both power supplies to power down. As a result, the node will have been powered down and two OFF signals will have been provided to the bus, one for each power supply unit. The absence of an ON status for the power supply units of the node can thus be identified by the management subsystems as indicative that the node has failed, and result in a reconfiguration of the fault tolerant computer system by the management subsystem.
In accordance with other aspects of the invention, there is provided a node of a distributed fault tolerant computer system including such a monitor system, and indeed a distributed fault tolerant computer system including a plurality of such nodes. In accordance with a further aspect of the invention, there is provided a method of monitoring operation of such a node of a distributed fault tolerant computer system.
An embodiment of the invention can thus provide a mechanism that allows for the automatic detection and reporting of a fault, whether hardware and/or software, which causes a delay in the generation of a counter reset signal by software operating at the node. The absence of such a reset signal can be indicative of a complete failure at the node, or alternatively a failure that means that the node is no longer able to meet the stringent real-time operation requirements of the telecommunications industry. Accordingly, an embodiment of the invention enables the automatic detection and reporting of such errors and provides the basis for enabling at least a degree of automatic reconfiguration of the distributed fault tolerant computer system in the event of such a fault occurring at a node.