A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The invention relates to communications between two computer systems. More particularly, the invention relates to providing communications between two servers in a server network, for monitoring the operational status of the servers, synchronizing events or actions initiated by the servers, and providing messaging capability between the two servers.
2. Description of the Related Technology
As computer systems and networks become more complex, various systems for promoting fault tolerance in these networks have been developed. One method of preventing network down-time due to the failure or removal of a fileserver from a server network, is to implement xe2x80x9cserver mirroring.xe2x80x9d Server mirroring as it is currently implemented requires a primary server, a primary storage device, a backup server, a backup storage device and a unified operating system linking the two servers and storage devices. The purpose of the backup server is to resume the operations of the primary server should it become inoperational. An example of a mirrored server product is provided by Software Fault Tolerance Level 3 (SFT III) provided by NOVELL INC., 1555 North Technology Way, Orem, Utah, as an add-on to its NetWare 7 4.x product. SFT III maintains servers in an identical state of data update. It separates hardware-related operating system (OS) functions on the mirrored servers so that a fault on one hardware platform does not affect the other.
The server OS is designed to work in tandem with two servers. One server is designated as a primary server, and the other is a secondary server. The primary server is the main point of update; the secondary server is in a constant state of readiness to take over. Both servers receive all updates through a special link called a mirrored server link (MSL), which is dedicated to this purpose. The servers also communicate over the local area network (LAN) that they share in common, so that one knows if the other has failed even if the MSL has failed. When a failure occurs, the second server automatically takes over without interrupting communications in any user-detectable way. Each server monitors the other server""s NetWare 7 Core Protocol (NCP) acknowledgments over the LAN to see that all requests for that server are serviced and that OSs are constantly maintained in a mirrored state.
When the primary server fails, the secondary server detects the failure and immediately takes over as the primary server. The failure is detected in one or both of two ways: the MSL link generates an error condition when no activity is noticed, or the servers communicate over the LAN, each one monitoring the other""s NCP acknowledgment. The primary server is simply the first server of the pair that is brought up. It then becomes the server used at all times and it processes all requests. When the primary server fails, the secondary server is immediately substituted as the primary server with identical configurations. The switch-over is handled entirely at the server end, and work continues without any perceivable interruption.
Although server mirroring increases security against down-time caused by a failed server, it does so at a considerable cost. This method of providing fault tolerance requires the additional expense and complexity of standby hardware that is not used unless there is a failure in the primary server.
Another method of providing fault tolerance in a server network which does not require additional redundant (mirrored) hardware is referred to as xe2x80x9cclusteringxe2x80x9d the servers in the network. Under one type of clustering method, a replicated Network Directory Database (NDD) operates in conjunction with server resident processes, running on a cooperating set of servers called a cluster, to remap a network resource to an alternate server, in the event of a primary server failure. A remappable resource is called a clustered resource. The records/objects in the replicated database contain for each clustered network resource, a primary and a secondary server affiliation. Initially, all users access a network resource through the server identified in the replicated database as being the primary server for the network resource. When server resident processes detect a failure of that primary server, the replicated database is updated to reflect the failure of the primary server, and to change the affiliation of that network resource from its primary to its backup server.
This remapping occurs transparently to whichever user/client is accessing the network resource.
As a result of the remapping, all users access the clustered network resource through the server identified in the replicated database as the backup server for the resource. When the primary server returns to service, the replicated resident processes detect a return to service of the primary server, the replicated database is again updated to reflect the resumed operation of the primary server. As a result of these latter updates to the replicated database, all users once again access the network resource through the server identified in the replicated database as the primary server for the clustered network resource. This remapping of clustered network resource affiliations also occurs transparently to whichever user/client is accessing the network resource, and returns the resource to its original fault tolerant state. A further discussion of the operation and theory of clustered networks is provided in a U.S. provisional patent application, entitled, xe2x80x9cClustering Of Computer Systems Using Uniform Object Naming And Distributed Software For Locating Objects,xe2x80x9d which is listed above under the heading xe2x80x9cPriority Claim.xe2x80x9d
The clustering method of remapping the affiliation of a network resource, reduces the amount of hardware, software and processing overhead required to provide fault tolerance, when compared with the mirrored server technique. However, in both of these methods and systems, a rather inefficient and costly method of monitoring the status of each server in the network is utilized. In order to detect that a primary server has failed, for example, these methods require both a primary server and a secondary server to communicate messages and commands across a LAN line and to process received messages and commands in accordance with a specified monitoring protocol.
One drawback of this method of providing communications between two or more servers within a server network is that it relies on a dedicated communications line, the LAN line, to communicate messages between the servers in the network. The LAN line is a valuable system resource which should be allocated only when necessary. Additionally, communicating across the LAN line is not totally reliable. If the bandwidth capacity of the LAN line is reached, or if the LAN line becomes physically damaged, it will not be able to handle communications from one server to another. Therefore, in order to provide a reliable method of monitoring and/or communicating between servers, a secondary method of communicating in the event that the LAN line becomes disabled is typically required. One such prior art secondary method includes a first server writing data, commands or information to an intermediate hard drive connected to a SCSI bus and a second server which reads the data, commands or information from the hard drive. Therefore, the hard drive serves as an intermediate depository for communicating between the SCSI adapters of two or more servers. One problem with this approach is that it creates a dependency on that device which is often a central point of failure. For example, if the hard drive xe2x80x9ccrashes,xe2x80x9d the two servers will not be able to communicate with each other.
A typical prior art LAN handshake protocol between two servers includes the following steps: a first adapter of a first server will send a NetWare7 Core Protocol (NCP) packet to a second adapter card of a second server in order to check whether a second server is handling all its requests. The first adapter card must then wait for the second adapter card to receive the NCP signal, process it, and then send a response, which contains the intranetware address of the second adapter card. If the first adapter does not receive the intranetware address data in response to its NCP packet, the first adapter will wait for a specified amount of time after which the handshake protocol xe2x80x9ctimes outxe2x80x9d and ends, resulting in a failure to achieve a communications link with the first server.
This approach is time-consuming and requires much xe2x80x9coverheadxe2x80x9d in terms of processing time and logic circuitry to process and synchronize the series of commands and data transferred between the adapters of two servers trying to communicate with one another. Therefore, what is needed is a method and system for establishing communications between two or more servers within a server network such that the status of a server may be monitored, events and actions initiated by the servers may be synchronized with one another, and two or more servers may communicate with one another in a cost efficient and reliable manner. Additionally, such a method and system should reduce the amount of required xe2x80x9coverheadxe2x80x9d, in terms of processing time and system resources.
The invention addresses the above and other needs by providing a method and system of communicating between two servers of a server network so as to monitor the status of each server by the other, to time and/or synchronize the events and actions initiated by one server with respect to the other, and further to provide bi-directional messaging capability between the two servers.
In one embodiment of the invention, a method of monitoring an operational status of a first server with a second server, includes: successively transmitting first and second command signals to a device coupled to the first server, wherein the first command signal places the device in a first status condition and the second command signal places the device in a second status condition; and monitoring a status condition of the device with the second server, coupled to the device, wherein a change in the status condition of the device indicates that the first server is operational.
In another embodiment, a method of monitoring a status condition of a first server with a second server in a server network, includes: transmitting a software-generated pulse waveform from the first server to a device coupled to the first server, wherein the software-generated pulse waveform comprises a first command corresponding to a logic level low and a second command corresponding to a logic level high; setting the device to a first state during logic level lows of the pulse waveform and to a second state during logic level highs of the pulse waveform; receiving the software-generated pulse waveform with the second server by determining when the device is in the first state and when it is in the second state; and determining when the device no longer changes from the first state to the second state.
In a further embodiment, a method of monitoring a status condition of a first server by a second server, includes: transmitting SCSI Reserve and Release commands from the server to a SCSI device, coupled to the first server; and monitoring a released/reserved status of the SCSI device with the second server.
In yet a further embodiment, a method of assigning control over a network resource, includes: transmitting SCSI Reserve and Release commands from a first server to a SCSI device, coupled to the first server; monitoring a released/reserved status of the SCSI device with a second server; determining if the first server is operational; and if it is determined that the first server has failed, assigning control over the SCSI device to the second server.
In another embodiment, a method of synchronizing a first operation carried out by a first server with a second operation carried out by a second server, includes: ransmitting a software-generated pulse waveform, having a first frequency, from the first server to a device coupled to the first server; receiving the pulse waveform with the second server by monitoring a status condition of the device; transmitting from the first server a synchronization signal to the device by changing the frequency of the pulse waveform to a second frequency; detecting by the second server the synchronization signal by detecting a change in frequency of the pulse waveform; changing at the first server the frequency of the pulse waveform back to the first frequency; detecting by the second server a change in frequency from the second frequency back to the first frequency; and setting in both servers a reference point in time at a beginning of a first cycle of the pulse waveform after it has returned to the first frequency.
In yet another embodiment, a method of providing communications between a first server and a second server, includes: transmitting a first software-generated pulse waveform from the first server to a first device coupled to the first server, wherein the first pulse waveform changes a status condition of the first device between a first state and a second state; receiving the first software-generated pulse waveform with the second server by sampling the status condition of the first device; frequency modulating the first pulse waveform so as to encode a message into the pulse waveform; and reading the message with the second server by sampling the status condition of the first device at a predetermined first sampling rate.
In another embodiment, a method of providing communications between a first server and a second server, includes: executing a first pulse transmitter program in the first server so as to transmit a first software-generated pulse waveform from the first server to a first device coupled to the first server, wherein the pulse waveform changes a status condition of the first device between a first state and a second state; executing a first pulse receiver program in the second server so as to receive the first software-generated pulse waveform by sampling the status condition of the first device; frequency modulating the first pulse waveform so as to encode a first message into the first pulse waveform; reading the first message with the second server by sampling the status condition of the first device at a predetermined first sampling rate; executing a second pulse transmitter program in the second server for transmitting a second software-generated pulse waveform from the second server to a second device coupled to the second server, wherein the second pulse waveform changes a status condition of the second device between a third state and a fourth state; executing a second pulse receiver program in the first server for receiving the second software-generated pulse waveform with the first server by sampling the status condition of the second device; frequency modulating the second pulse waveform so as to encode a second message into the second pulse waveform; and reading the second message with the first server by sampling the status condition of the second device at a predetermined second sampling rate.