With the growth of distributed operations in telecommunications and computing operations, there are now often collections of equipment and software, i.e., devices, which operate in concert with one another in providing some service to the larger telecommunications computing, or electrical or electro-mechanical system in which they reside.
This can be as simple as a CPU which accesses a database stored on a separate server in responding to a plurality of workstations or can consist of a more elaborate collection of equipment.
For example, in a computer environment, a group of computers may each act as a tasking station to provide information to a larger group of workstations. Each tasking station computer does not contain necessary databases populated with data to perform all the requested tasks to respond to the served workstations. Instead, when information from a database is required, the tasking station computer accesses a server which has the populated databases resident. The separation of server from tasking station allows efficient use of system resources. For example, the server can have greater memory and speed than the querying tasking stations. Moreover, more than a single server can be used to increase response time and to segregate specific types of data. Further, to assure the efficient utilization of all servers, several controllers can be used to monitor data requests from the tasking stations and direct and distribute the tasking stations' requests between all servers. In this example, communication between the servers and the tasking stations must be through the controllers. This interactive subset of system components, from the perspective of the remainder of the system, could just as well be a single piece of equipment. It is unimportant which workstation connects to which tasking station since each tasking station has the potential to provide the same results in response to a request from the workstation. This seamless collaboration of devices providing a service to the other resources in the system is referred to hereafter as an orchestrated entity (OE).
As long as the OE is fully functional, that is, each and every device making up the OE is functioning and the communication lines between the devices are operational, then the OE can properly serve the other system resources. However, should any device or communication line go down, there is the potential that the OE may not be able to adequately serve the other system resources. As examples, this may be because, in the case of a server, that pertinent database information cannot be accessed; in the case of a controller, that not all servers are accessible; and in the case of tasking stations, that requests from other system resources accessing the OE through that particular tasking station will not be recognized by the OE. On the other hand, OE's can be constructed to have a certain resilience against loss of one or more devices within the OE, being capable of operating without all devices operating.
The problem, then, is to determine whether the OE is sufficiently operational at any point in time to adequately provide the desired service to the other network resources. In conventional systems, this has been the task of a master device or master node which monitors the condition of each other device and communication channel. If a device or communication channel does not respond to the master node's direct polling, that device is declared "Out of Service". Based on the master node's polling results, the master device then determines whether the OE, as a whole, is sufficiently operational to adequately provide the desired service to the other network resources and, if not, declares the OE "Out of Service" to the other network resources which must then either seek the service from alternative system resources or await the OE returning to "In Service" status.
The conventional approach to determining resource status for an OE suffers from several problems. First, the master node must be selected and connected to each device and line in the OE. Second, if the master node fails, the OE must be declared out of service because the status of other devices and lines in the OE cannot be otherwise determined.
There is therefor a need for an approach for determining whether devices in the OE are In Service and whether the OE is In Service without the use of a master polling device.
Solution
Each and Every Device Comprising Part of the OE determines OE Status for Itself.
This problem is solved and a major advance over the prior art is achieved by the instant invention which provides a distributed method for each device in an OE to self-diagnose whether the OE should be considered by that device In Service based that device, comprising part of OE, independently determining that sufficient other devices comprising the OE are operational that the device making the determination can declare itself In Service.
Conventional processor-based electronic equipment and software, i.e., devices, are capable of polling, that is, sending a signal through a communication line to another device, receiving a signal in response, and recognizing that responding signal. In one type of polling called "Echo" polling, the initiating device sends a signal which is simply returned by the polled devices. In a second type of polling, the initiating device sends a more complex signal, one which includes as part of the signal an identifier which identifies the initiating device, and the responding device likewise responds with a more complex signal, one which includes as part of the signal an identifier which identifies the responding device as part of the returned signal.
Both types of polling are used in providing device status to monitoring equipment.
The instant invention recognizes and implements a series of rules discovered by the inventors which, through device polling, permit the devices comprising the OE to self-diagnose OE status.
To simplify this discussion, a device which relays signals from an initiating device to a responding device is hereafter called a "hub". An initiating device and a responding device are both called "nodes". The pathway by which signals are sent between the nodes and hubs are called "lines". Devices are said to "talk" to one another when an initiating signal results in a responding signal, regardless of the type of signal.
A Rule-Based Determination of Device Status
A. Simplistic Determination--"In Service" Permits Only Failure of One Hub
A node declares itself "In Service" if it passes either of the following tests:
Rule 1) if a node is able to talk to all hubs in the OE, then the node is "In Service"; or PA1 Rule 2) if the node fails Rule 1, but the node can talk to all other nodes, then the node is "In Service". PA1 Rule 3) If the node can talk to all hubs and nodes that any other node says it can talk to, and the sum of the values of all these nodes is more than a predetermined "threshold" value, then the node is In Service.
Otherwise, the node is "Out-of-Service".
Referring to FIG. 1, an OE is illustrated which has two hubs, Hi and Hii, each connecting four Tasking Stations, TS1-TS4, to two Responding Devices, RDa and RDb. Consider RDa as the node of interest, consider further that hub Hi, which is normally capable of sending a return signal to RDa is not currently able to do so, and consider further still that all other hubs and nodes are capable of sending a return signal.
As a consequence of not being able to talk to Hi, RDa fails to be "In Service" under Rule 1 and must apply Rule 2 to determine its final status. Under Rule 2, in the particular configuration shown in FIG. 1, RDa connects through operational hub Hii to all other nodes, RDb and TS1-4; consequently, RDa will declare itself "In Service".
Similarly considering every other node in FIG. 1 against Rule 1 and Rule 2 will likewise result in a determination that the node is "In Service" because, as the OE is configured, lines extend from each node to each hub. Thus, every node can talk with every other node through hub Hii, even if hub Hi is not operational, thereby satisfying Rule 2.
Next consider that not only Hi but also RDb is not capable of sending a signal. RDa fails to be "In Service" under Rule 1 since it cannot talk with all hubs and further fails to be "In Service" under Rule 2 because it cannot talk with RDb.
Similar investigation of every other node in FIG. 1 against Rule 1 and Rule 2 will likewise result in a determination for each node that the node is likewise out of service, first, because no node can talk with hub Hi--failing to satisfy Rule 1--and second, because no other node can talk with node Rdb--failing to satisfy Rule 2.
Using the above rules, there will be circumstances in which certain nodes declare themselves "In Service" under Rule 1 because they are able to talk to hubs while other nodes in the OE are not operational. The application of Rule 1 permits the operational devices in the OE to be "node resilient" under Rule 1. However, should a hub subsequently fail, that resilience ceases and all nodes in the OE will declare themselves "Out of Service" by failing under Rule 2 after failing under Rule 1 when a node and a hub are "Out of Service". Thus, if the decision whether an OE is In Service is determined by each and every node comprising the OE declaring that it is In Service, then beyond Rule 1 application, any node declaring an Out-of-Service status. Consequently, by such actions, the OE will in such a circumstance also be Out of Service. Thus, in the most stringent and simplistic approach to OE status, testing any single operational node under Rules 1 and 2 likewise determines OE status, no further tests being required.
B. Operational Importance Determination--"In Service" Permits Failures Until Weighted Functionality Degrades Below Acceptable Level
However, there may be in some systems, and for some OE's, the need or ability for the OE to have more resilience, that is, to function without all nodes being declared In Service. This is easily accomplished under the present invention by assigning an identifier to each node; listing each node by its identifier in the processor memory in each node and in each hub; assigning a value to each node and to each hub; and correlating the assigned value to the appropriate listed nodes and hubs in the processor memory of each node such that identification of a node likewise identifies the assigned value of the node. Further, the polling signal and responding signal sent by each node is constructed to include the identifier of each node and each hub with which that node can talk. Once this is implemented, then each node can be tested under the following rule:
Consider FIG. 1 again, noting that RDa and RDb have both been assigned a value of 4; that Hi and Hii have both been assigned a value of 0; and TS1, TS2, TS3 and TS4 have each been assigned a value of 2. Assuming that Hii and TS4 are nonfunctioning, each node commences polling by sending a polling signal which includes only its own identifier. Each responding node responds with both its identifier plus the identifier of every other node from which it has received a responding signal plus the identifier of the node which sent the polling signal. In FIG. 1, this means that TS1 receives a responding signal from TS2,TS3,Hi,RDa and RDb. Likewise, every other node in the OE will receive responding signals from the set of nodes (TS1,TS2,TS3,Hi,RDa,RDb) since every operating node can communicate through hub Hii. Consequently, the first part of Rule 3 is satisfied, each polling node can talk to all hubs and nodes that any other node says it can talk to.
For the purpose of this discussion, assume that the last character of the node designation is the node identifier, e.g., the node identifier for TS4 is 4 and that for RDb is B. The polling signal for TS1 initially includes "1" as its identifier. As each node and hub responds, each identifier for each responding node and hub which is not then part of the polling signal for TS1, will be added to and made a part of the polling signal for TS1. This is also the case with every other node.
Moreover, as each node and hub responds to TS1, it includes as part of its responding signal the identifier for every node and hub which has responded to its polling signal. Eventually, the polling signal for TS1 includes the identifiers (1-2-3-I-A-B). This means that TS1 has received a responding signal from TS2,TS3,Hi and RDb, but has not received a responding signal from either TS4 or Hii. In one implementation of the instant invention, the order in which the identifiers appear in a signal determine which is the identifier for the polling node, which is the identifier for the responding node or hub and which are the identifiers for the nodes and hubs with which the node can talk.
Likewise, each of the other functioning nodes will ultimately have an initiating signal that includes the identifiers (1-2-3-I-A-B) and that each node will ultimately provide a responding signal that includes the identifiers (1-2-3-I-A-B) as well.
Applying Rules 1 and 2, the failure of Hii and TS4 would result in all active nodes in the OE declaring themselves "Out of Service". However, in this extension of the invention, using the same polling concepts but assigning weighted values to the various devices comprising the OE enables the nodes comprising the OE to continue to declare themselves In Service unless and until the functionality of the OE degrades below a determined level, thus increasing the operational resilience of the OE. In the instant invention that degradation is expressed in a correlated value of the nodes and hubs operating in the OE exceeding a selected threshold value.
Applying Rule 3 to this example and specifically to TS1, TS1 will ultimately receive a signal having a correlated value of 14 (TS1=2; TS2=2; TS3=2; Hi=0; RDa=4; RDb=4) since neither TS4 nor Hii are functioning.
Avoidance of "Split Personalities" by Threshold
The present invention appreciates that, in order to prevent a single OE from dividing into two OE's that both think they are In Service, the correlated value for the nodes and hubs providing responding signals to a polling node must be more than half the sum of the correlated values of all the nodes and hubs comprising the OE. Thus, in the example of TS1 above, in order to be declared In Service, first, no other node can talk to more nodes than TS1 and, second, the sum of the correlated values for the nodes and hubs providing responding signals to TS1 must be more than 8--half of 16 which is the sum of the correlated values for all nodes and hubs comprising the OE in the instant example. The value 8 is thus the threshold value which must be exceeded in order to avoid nodes comprising two halves of the OE to separately declare themselves "In Service" and operate independently as though they were each the OE.
In this example, since 14 is greater than 8, and since TS1 is able to talk with the same nodes that any other node is able to talk to, TS1 will declare itself "In Service". Investigation of every other functioning node in the OE of FIG. 1 will result in the determination that every other node will likewise have a correlated value of 14 for the sum of the nodes and hubs with which it can talk and will declare itself "In Service". This indicates that any node is capable of determining the service status of the OE. Hence, if under Rule 3 any node declares itself In Service, then the OE as a corollary will be "In Service".
The reader may note that no attempt is made to independently test the lines between nodes. This is unnecessary inasmuch as a failure of a line will result in no signal being communicated across the line. Consequently, a node or hub at either end of the line is unable to talk with its counterpart. Consequently, the loss of a line is the same as the loss of a hub or a node from the perspective of the individual node and from the perspective of the OE's service status.
In accordance with one aspect of our invention, distributed determination of OE status is achieved by each node comprising part of the OE independently determining its service status.
In accordance with another aspect of our invention, the status of a node by corollary determines the status of the OE.
In accordance with yet another aspect of our invention, the status of a node is determined by applying two conditions: first if the node can talk to all hubs in the OE, then it is In Service; and second, if the node fails to be declared In Service under the first condition but can talk to all other nodes in the OE, then it is In Service.
In accordance with still yet another aspect of our invention, an OE can be considered still In Service, although not all nodes or hubs are In Service, by a node determining which nodes and hubs are In Service, weighing the importance of the In Service nodes and hubs to OE In Service status, and determining if the importance of the In Service nodes and hubs is sufficient to consider the OE In Service.
In accordance with a further aspect of our invention, the importance of nodes and hubs is represented by weighted values, each node and each hub in the OE being assigned a weighted value.
In accordance with a still further aspect of our invention, determining whether the aggregate importance of responding nodes and hubs is sufficient to declare a polling node OE In Service is determined by summing the assigned values of the In Service nodes and hubs and determining if that sum exceeds a threshold value.
In accordance with a still yet further aspect of our invention, to avoid an OE splitting into two OE's, the threshold value must be greater than one half the sum of all the values for all nodes and hubs in the OE.