1. Field of the Invention
The present invention generally relates to fault tolerant network systems, and, more specifically, the present invention relates to an active/pseudo-active/stand-by queue computer system in which transparent request processing takeover is enabled in the case of a software and operating system failure in an active queue computer.
2. Description of the Background
As computer networks become more prevalent in everyday life, the number of both business and personal activities that rely on such methods has increased dramatically. Generally speaking, in a typical network environment, there are a plurality of client computers (users) that issue requests to a plurality of application servers (AP servers) that process those requests. The client computers and AP servers are referred to as existing on the client server layer and the AP server layer, respectively, with the interconnection points on the system known as nodes.
Because so many different computers or other equipment are connected to a single network, the system must have a way in which one machine can communicate with another machine in an efficient way. Because a direct communication link from every computer and/or piece of equipment to every other computer would be costly, networks typically allow for some type of shared communication. The most popular type of network communication scheme involves assigning each machine connected to the network a unique address, and then sending information back and forth across the network in discrete units of information (e.g., “packets”) that carry the address of both the sending computer (“sender”) and the receiving computer (“receiver”). In this way, each machine can identify information that is addressed to it, and respond accordingly. One popular protocol for such a communication procedure is TCP/IP (Transaction Control Protocol/Internet Protocol).
Each machine on the network physically interconnects with the wired or wireless communication medium via a network interface card or “NIC.” Every NIC has a machine address (MAC) that is a unique identifier of that machine. TCP/IP packets that travel throughout a network contain not only data but also address information including the IP address (actual or alias) and the MAC address of both the sending and the receiving computers. The NIC browses all of the packets that travel throughout the network searching for packets that contain its own MAC/IP address. If the NIC determines that a packet contains its own address as the destination, the NIC will receive the entire packet. Likewise, if a packet does not include the NIC's address, the packet will generally be ignored by the NIC.
A NIC may also have a “promiscuous mode” in which all packets are received. If the promiscuous mode is set for a NIC, all packets are received by the NIC without determining whether or not the packet includes that NIC's MAC address. This process of receiving all packets is known as “sniffing” all of the packets on the network. This process may useful in the design of fault tolerant systems.
As mentioned above, every TCP/IP packet that travels throughout a system includes the IP address and the MAC address of both the sender and the intended receiver of the packet. These IP and MAC addresses are added by a network driver running on the sending computer with a NIC installed therein. In some fault tolerant systems, the “sending” IP address and MAC address are set by the NIC so as to reflect the IP and MAC addresses of a machine other than that which actually sends the packet. Through this process, the NIC which receives the packet with the “incorrect” IP and MAC address information will not respond to the computer who sent the packet (because it will not have that computer's proper addresses). This process is known as “spoofing” the packets, and any reply to such a spoofed packet will be sent to the spoofed machine. This technique is often used by hackers who do not wish to be identified.
In order to communicate with another machine over a TCP/IP network, a packet sending computer needs to know the IP address and the MAC address of a receiving computer. The method for determining the correspondence between a NIC's MAC address and the IP address is known as Address Resolution Protocol (ARP). To determine a host B's MAC address, host A sends out an ARP request for the hardware address of host B. Host B sees the request on the network and sends an ARP reply containing the hardware address for the interface with the IP address in question. Host A then records the hardware address in its ARP cache for future use in sending TCP/IP packets.
There is also a similar process known as a gratuitous ARP request. A gratuitous ARP is an ARP reply that is sent by a host even when no ARP request is made to that host. A gratuitous ARP reply is typically addressed to the broadcast hardware address so that all hosts on the local area network (LAN) will receive the ARP reply. Upon receipt of the gratuitous ARP reply, all hosts on the network refresh their ARP cache to reflect the gratuitous ARP. Gratuitous ARP is generally used when the relationship between an IP address and a MAC address changes, i.e., when an IP takeover occurs. With the gratuitous ARP, the change can be communicated to all other hosts on the system in an efficient manner, so that future communications reflect the change.
Because these network communication schemes are so complex, it is often difficult to determine whether or not a certain machine or a certain node on the network is not functioning properly. Because a machine may be engaging in many simultaneous communications with different computers, it may also be difficult to track when these errors occur. Additionally, when an error occurs in a network machine, the other machines must either be made aware of this error (failure state), or some background process must be undertaken to correct the error without the knowledge of the other computers on the network. Such a correction procedure is considered “transparent” to the other computers.
The general category of these error correction or mitigation systems is known as fault tolerant processing. One of the main problems to be addressed by fault tolerant processing is the taking over by another computer (either temporarily-or permanently) of the operations of one computer which enters a fault state. Traditionally, this takeover can only occur in either a limited fashion, or with some disruption caused to the users or client computers (i.e., it is not transparent). Some of the more recent fault tolerant systems are now described.
Japanese Patent Publication No. 11-296396 entitled “Widely Applicable System With Changeover Concealing Function” generally describes a fault recovery system including an active system host and a stand-by system host. Each of these hosts has an associated NIC attached thereto, and each of the hosts is also attached to the NIC of the other system host (active host to stand-by NIC and stand-by host to active NIC). If a system failure occurs in the active system, the stand-by system host will use the NIC of the active system host (via the direct connection), and then MAC address and IP takeover may be achieved.
This system is undesirable to implement because it requires the use of additional hardware (the host/NIC interconnections). Additionally, if the failure error occurs on the active host NIC, rather than on the active host itself, this system is unable to perform the requisite MAC address takeover. Finally, this system is limited in that only MAC address takeover occurs. Any transactions that are being processed during the time when the active host system enters failure mode will be lost, and cannot be recovered to be executed by the stand-by host.
U.S. Pat. No. 6,049,825 entitled “Method and System for Switching Between Duplicated Network Interface Adapters for Host Computer Communications” generally describes a method for switching between duplicated network adapters to provide some level of fault tolerance. This system uses a gratuitous ARP to enable NIC takeover upon detection of a fault.
This type of gratuitous ARP method is not preferred because it is not transparent to the client. In others words, the client computers are aware of the fault. Additionally, because the NIC takeover occurs because of the gratuitous ARP process, the takeover takes an undesirable amount of time for completion. Finally, in a similar problem to that discussed in the above case, this system only provides for NIC takeover, and any transaction that is currently being processed during the time of the fault's occurrence will be lost and must be re-sent by the client.
Japanese Patent Publication No. 2000-056996 entitled “Decentralized Queue System” generally describes a method in which a plurality of queues, that are communicatively connected to each other, are interspersed between several client computers and several application servers. A client request is en-queued in a first queue, and the request is then additionally en-queued into a second queue before the client sends out a second request to the queues. If a fault occurs in the first queue during processing of the request, the second queue is capable of sending the request to the application servers for processing (as well as potentially copying the request to yet another queue in case of a fault in the second queue).
Initially, this method is not satisfactory for communications systems that rely on the speed of data travel because of the large amount of overhead involved in passing every request to more than one queue. Also, although this method somewhat lessens the chance of losing a request, it does not accomplish an IP takeover procedure. Finally, once a queue fails, the connection between the queue system and the server is lost, and the clients must be made aware of the error to re-route future packets—all of which creates even more overhead.
Japanese Patent Publication No. 10-135993 entitled “Address-Setting System in Duplicate System” generally provides a system capable of performing MAC address and IP address takeover. The system includes an address buffer that stores a plurality of MAC addresses. If a system failure occurs, a stand-by computer performs an IP takeover using an alias IP address and MAC takeover using this address buffer. A system rebooting process for the originally active computer (now in stand-by) can test if the network environment is functioning normally by using the original IP address and MAC address.
Although this system is capable of performing MAC address and IP address takeover, it is not capable of addressing any of the requests that are currently being processed by the host when a failure occurs. Again, these requests are lost unless and until they are resent by the client. Also, the above MAC address and IP address processes are not efficient—these takeover processes take an extended period of time to perform.
Japanese Patent Publication No. 2000-322350 entitled “Server Switching System For Client Server System” generally describes a system for rerouting requests to bypass servers that have encountered a failure. The system includes a switching mechanism inserted between a client and a plurality of application servers. All requests from the client to the application servers pass through this switching mechanism so that, if a failure occurs in one of the application servers, the switching mechanism will reroute the request to a substitute server. This rerouting is transparent to the client.
Initially, this system is not preferred because it necessitates an additional layer of hardware in the network system. Also, this system only conceals errors that occur in the application servers—errors that are “below” the switching mechanism. Therefore, errors that occur in the switching mechanism or “above” the switching mechanism are not addressed by this system and are not transparent to the client. Finally, this system does not address requests that are currently being processed in the application server at the time when the server encounters a fault state.
In all, these conventional systems do not provide a transparent and efficient solution to the problem of temporary or permanent IP and MAC address takeover when a failure occurs in an active computer. The ability to both transparently switch processing from one computer to another, as well as to perform this operation without losing any of the requests or transactions that have been queued but not sent out of the active computer at the time of failure is desired. The present invention preferably addresses at least some of the above problems with the conventional systems.