1. Field of the Invention
The present invention relates to the design of computer systems, and, in particular, to computer system architectures that provide automatic protection switching from active functions to standby functions to maintain uninterrupted system operations.
2. Description of the Related Art
In certain computer systems, it is important to maintain uninterrupted system operations, even in the event of catastrophic failure of one or more of the functions of those systems. For example, a switch node in a telecommunication system may be responsible for receiving, routing, and re-transmitting a large number of signals to support telecommunications between many pairs of end users of the system. The switch node may consist of a number of circuit boards operating together to support the overall signal switching operations of the node.
FIG. 1 shows a block diagram of a switch node 100 for a telecommunication system. Switch node 100 comprises working switch function 102 connected to a plurality of working port functions 104. Each working port function 104 provides the interface between working switch function 102 and the rest of the telecommunication system for a particular subset of the signals that are routed by working switch function 102. Depending on the implementation, each working switch and port function may be implemented on a separate circuit board.
In order to maintain uninterrupted telecommunication services to the various end users, it is desirable to provide a hardware design that enables switch node 100 to continue to operate even if one of the working circuit boards fails or otherwise begins to operate in a degraded manner. This may be achieved by providing an additional switch circuit board as a protection (i.e., backup) switch function and an additional port circuit board as a protection port function. If working switch function 102 fails, the protection switch function can assume its switch operations. Likewise, if any one of the working port functions 104 fails, the protection port function can assume its port operations. Such schemes are referred to as protection switching, where a protection function is switched on-line to assume the responsibilities of a failed working function. The word "switching" in the term "protection switching" refers to the switching of functions and is not related to the switching of signals provided by the switch functions in the telecommunication example of FIG. 1.
FIG. 2 shows a block diagram of a generic system 200 that provides protection switching in the event of function failure. In system 200, a working server function 202 communicates with client function 206 to support system operations. System 200 also has a protection server function 204, identical to working server function 202, as a backup in case working server function 202 fails to continue to operate properly. Protection server function 204 may be inactive (i.e., cold standby) or it may be operating off-line (i.e., hot standby). In addition, system 200 has control function 208, which monitors the operations of both working and protection server functions 202 and 204 for failures.
As used in this specification, the terms "server" and "client" are merely used to distinguish functions and are not intended to limit the types of operations performed by those functions. For example, referring to the switch node of FIG. 1, working server function 202 of FIG. 2 may be a working switch function, protection server function 204 may be a protection switch function, and client function 206 may be one of the port functions. In that case, FIG. 2 corresponds to protection switching provided for working switch function 102 in FIG. 1. Analogous protection switching may also be provided for working port functions 104.
According to one conventional equipment protection scheme, if a failure in working server function 202 is detected, control function 208 activates protection server function 204, if necessary (e.g., if protection server function 204 was in a cold standby mode), and directly instructs client function 206 via communication link 210 to switch its selection of which server function is active from working server function 202 to protection server function 204.
One drawback with this scheme is that it requires control function 208 to maintain knowledge of client function 206 and to communicate directly with client function 206. There are some applications in which a single server function may support a large number of client functions, each similar to client function 206 (e.g., where each client function is a port function as in FIG. 1). In such applications, control function 208 must maintain knowledge of a large number of client functions. Whenever the configuration of the client functions changes (e.g., a client function is added or deleted from the system), the database of information in control function 208 must be updated. In addition, unless the client functions are designed to transmit acknowledgment messages back to control function 208, control function 208 is never sure whether all of the client functions will have received its instructions. If, for example, a particular client function was temporarily off-line when protection switching instructions were sent, the client function would not have received the instructions, control function 208 would not know that the client function had not received the instructions, and, when the client function is brought back on-line, it will assume that the failed working server function 202 is still the active server function. Moreover, the requirement for control function 208 to communicate directly with each client function may cause a relatively long delay after a server function failure before all of the client functions can be instructed to switch to protection server function 204, which can result in an interruption of the overall system operations. For example, in a telecommunication system, under such circumstances, signals between one or more--and possibly all--pairs of end users may be dropped.
According to another equipment protection scheme, if control function 208 detects a failure in working server function 202, the control function 208 informs only server functions 202 and 204 of the need to switch from working server function 202 to protection server function 204. Server functions 202 and 204, in turn, notify the client functions of the need to switch to protection server function 204. This is typically implemented by each server function using in-band signaling in which a specific status bit in the overhead data communicated to each client function identifies whether or not the server function is active. Each client function monitors that status bit from each server function to determine whether to continue to operate with working server function 202 or to switch to protection server function 204. If a failure in working server function 202 causes all communications with the client functions to cease, the client functions will use the lack of signal from working server function 202 as an indication of the need to switch operations to protection server function 204. Under this equipment protection scheme, communication link 210 of FIG. 2 is not needed. This scheme alleviates many of the problems associated with requiring the control function to communicate directly with each client function. There are however possible situations, e.g., race conditions and other ambiguous states, that were not adequately addressed in this scheme.