This invention relates to multi-component systems in which components are replicated to provide reliability.
Large systems are generally composed of many smaller components which may individually be hardware, software, firmware, or a combination of such. Different types of components perform different functions within the larger system. In order to ensure reliability should an individual component fail, components are replicated within the system, i.e., multiple copies of the same component type are present in the system. Thus, if a component fails, its function may be taken over by a backup component to the failed component.
In a large system, the condition of each component, working or backup, needs to be monitored. The primary condition of any component at any time may be either: 1) working, i.e., performing its intended function; 2) ready, i.e., not actually performing its intended function but healthy and capable of performing; 3) unready, i.e., not yet in an operative state and requiring some action before the component becomes operative; or 4) unusable, i.e., the component is not usable. The primary condition of a backup component is normally ready but occasionally may be unready.
In a replicated component system, a mechanism needs to be present to monitor and control the configuration of the system, where the configuration of the system is defined as the collection of all components and their conditions, and the relationships between the components. In the prior art, in addition to being a backup component to another component, other relationships between components have been recognized to exist. For example, one or more components can be a child of a parent component, where the child performs work for its parent. If a parent fails, the system needs to be reconfigured to find another parent for all of its children. Another well known relationship is known as sparing. In a sparing relationship, a minimum number of working components of the same type are required to perform the same function, each being able to do the work of one another. A special case of the sparing relationship between only two components of the same type is known as the mate relationship in which one of the components in the pair is working and the other one is a standby component.
A configuration controller is an entity associated with a system that allows the configuration of the system to be changed. For example, a configuration controller can determine that there are not enough working components of a given type and change the condition of a spare component to make it a working component . In the prior art, configuration control either could be effected from a central control location or could be distributed throughout the system. In the prior art, configuration controllers behave inconsistently depending upon what type of component is being changed. Specifically, such inconsistency may result from mistakes made by an individual who is manually performing the steps to effect a change. Further, in automatic fault detection systems in which component failures are detected and acted upon, the software that performs the functionalities to effect the changes is frequently written incorrectly due to a lack of understanding by the programmer as to how each component may need to be changed differently than how other components are changed. Thus, such fault detection systems generally fail to determine the overall effect on a system that may result from changing one component. Specifically, making a change to one component can actually be more deleterious to the state of the system as a whole than leaving a possibly faulty component in the system. Also, such fault detection systems fail to consider the order in which the conditions of different components may need to be changed. Even further, in the prior art, every time a new type of component is introduced into a system, a re-analysis of the configuration control for that component needs to be performed.
A need exists, therefore, to be able to consistently control the configuration of a system by monitoring the conditions of the components that comprise the system and allowing a change to the condition of each of the components.
The configuration controller, in accordance with the present invention, in response to a request to change the condition of a component from a fixed set of request operations, performs one of a fixed set of common algorithms in which either the request is denied or is carried out by following a predetermined sequence of steps which are determined in accordance with requested operation. Advantageously, the fixed set of common algorithms can be applied to any system consisting of a plurality of replicated components of any kind be they software, hardware, firmware, or any combination thereof, once the relationships between components from a fixed set of relationships is defined and provided to the configuration controller.
In an embodiment of the present invention, when a request to change the condition of a component is received, the configuration controller first decides whether the result of that change will result in a safe condition for the system as a whole, i.e., it will determine whether the system will behave properly once the change is made. Thus, each request is validated before realization of the request is effected through a particular algorithm from the set of fixed realization algorithms.
In order to process requests for a change in the condition of a component, various information relating to each component within the system and the relationships that exist between components in the system are provided to the configuration controller. Thus, each component in the system has one and only one component type, where the type of each component in the system is defined as possessing certain attributes associated with all components of that type. Further, each component may have relationships with other components in the system, which relationships are specified. The inventors have discovered additional relationships that can exist between components in addition to the previously noted prior art relationships of child, spare, and mate. These relationships between components include a helper, an interrupt physical unit group, a logical unit group, and a switch physical unit group. Components of one particular type may be only related to components of other types by certain relationships.
Each component in a system at any instance of time may have one and only one condition. The condition of a component is a combination of four independent conditions: (1) a primary condition, which may be one of: working, ready, unready, and unusable; (2) a secondary condition, which may be one of: primary, secondary, forced, automatic, manual, growth, and null; (3) a tertiary condition, which may be one of: campon, degraded, diagfail, diagnostic, facfail, family of equipment, farfail, helper, init, interdiag, maintiprog, nearfail, pwralarm, pwroff, roudiag, trouble, update, warming, and null; and (4) an inhibit state, which may be either inhibited or uninhibited.
When a request to change the configuration of a system is received, it can be one of a defined fixed plurality of request types that are applied to the system: remove, restore, switch, inhibit, allow, restart, refresh, resynchronize, and update. Further, in the exemplary embodiment of the invention, a desired condition of the component is specified with each request as well as a level of validation, which, in the particular embodiment, is an integer between one and six. For each request type, one and only one predetermined fixed algorithm describes the validation processing that is used to effect such a request. Further, for each request type, one and only one predetermined fixed algorithm describes the realization processing that is used to effect such a request. Both the set of validation algorithms and the set of realization algorithms are commonly applied to any replicated component system regardless of what those components are or how they function in the system as a whole, and whether they are hardware, software, or a combination of both.
Advantageously, once all of the components of a system are individually defined by their type and the relationships that exist between all the components, which information initializes the configuration controller, a subsequent request for a condition change to one or more components from a fixed set of possible requests is effected by first validating the request and then, if the request is validated, realizing that request by performing the steps of the particular realization algorithm associated with the request type. Thus, for example, a component may be suspected of being faulty, which information might be signaled to a maintenance administrator through, for example, conventional fault detection circuitry. As a result, removal of that component from the system may be desired so that it can be diagnosed. That component thus needs to be changed from being an active component to a standby component, for example. In addition to changing the condition of the suspect component, the condition of many other components that have a relationship with the suspect component may also need to be changed such as those components that were sending data to or receiving data from the suspect component. Further, other components may need to become active, when originally their conditions were standby. By using the relationships that exists between components and by following the steps in the remove realization algorithm, an appropriate sequence of component condition changes are effected to safely achieve the desired condition change of the subject component.