1. Field of the Invention
The present invention relates to clustered computer systems with multiple nodes that provide services in a scalable manner. More specifically, the present invention relates to a method and an apparatus that uses a packet distribution table to distribute packets between a cluster of server nodes that operate in concert to provide a service.
2. Related Art
The recent explosive growth of electronic commerce has led to a proliferation of web sites on the Internet selling products as diverse as toys, books and automobiles, and providing services, such as insurance and stock trading. Millions of consumers are presently surfing through web sites in order to gather information, to make purchases, or purely for entertainment.
The increasing traffic on the Internet often places a tremendous load on the servers that host web sites. Some popular web sites receive over a million xe2x80x9chitsxe2x80x9d per day. In order to process this much traffic without subjecting web surfers to annoying delays in retrieving web pages, it is necessary to distribute the traffic between multiple server nodes, so that the multiple server nodes can operate in parallel to process the traffic.
In designing such a system to distribute traffic between multiple server nodes, a number of characteristics are desirable. It is desirable for such a system to be efficient in order to accommodate as much traffic as possible with a minimal amount of response time. It is desirable for such a system to be xe2x80x9cscalable,xe2x80x9d so that additional server nodes can be added an distribution to the nodes can be modifiable to provide a service as demand for the service increases. In doing so, it is important to ensure that response time does not increase as additional server nodes are added. It is also desirable for such a system to be constantly available, even when individual server nodes or communication pathways between server nodes fail.
A system that distributes traffic between multiple server nodes typically performs a number of tasks. Upon receiving a packet, the system looks up a service that the packet is directed to. (Note that a collection of server nodes will often host a number of different servers.) What is needed is a method and an apparatus for performing a service lookup that is efficient, scalable and highly available.
Once the service is determined, the system distributes workload involved in providing the service between the server nodes that are able to provide the service. For efficiency reasons it is important to ensure that packets originating from the same client are directed to the same server. What is needed is a method and an apparatus for distributing workload between server nodes that is efficient, scalable and highly available.
Once a server node is selected for the packet, the packet is forwarded to the server node. The conventional technique of using a remote procedure call (RPC) or an interface definition language (IDL) call to forward a packet typically involves traversing an Internet Protocol (IP) stack from an RPC/IDL endpoint to a transport driver at the sender side, and then traversing another IP stack on the receiver side, from a transport driver to an RPC/IDL endpoint. Note that traversing these two IP stacks is highly inefficient. What is needed is a method and an apparatus for forwarding packets to server nodes that is efficient, scalable and highly available.
One embodiment of the present invention provides a system that uses a packet distribution table to distribute packets to server nodes in a cluster of nodes that operate in concert to provide at least one service. The system operates by receiving a packet at an interface node in the cluster of nodes. This packet includes a source address specifying a location of a client that the packet originated from, and a destination address specifying a service provided by the cluster of nodes (and possibly a protocol). The system uses the destination address to lookup a packet distribution table. The system then performs a function that maps the source address to an entry in the packet distribution table, and retrieves an identifier specifying a server node from the entry in the packet distribution table. Next, the system forwards the packet to the server node specified by the identifier so that the server node can perform a service for the client. In this way, packets directed to a service specified by a single destination address are distributed across multiple server nodes in a manner specified by the packet distribution table.
In one embodiment of the present invention, the system allows the server node to send return communications directly back to the client without forwarding the communications through the interface node.
In one embodiment of the present invention, the function includes a hash function that maps different source addresses to different entries in the packet distribution table in a substantially random manner. Note that this hash function always maps a given source address to the same entry in the packet distribution table.
In one embodiment of the present invention, a policy for distributing packets between server nodes in the cluster of nodes is enforced by varying a number of entries in the packet distribution table for each server node. In this way, a server node with more entries receives packets more frequently than a server node with fewer entries.
In one embodiment of the present invention, the source address includes an Internet Protocol (IP) address and a client port number. In one embodiment of the present invention, the destination address includes an Internet Protocol (IP) address, an associated port number for the service and a protocol identifier (such as transmission control protocol (TCP) or user datagram protocol (UDP)).
One embodiment of the present invention uses the destination address to select the packet distribution table associated with the service from a plurality of packet distribution tables. In a variation on this embodiment, each packet distribution table is associated with a service group including at least one service provided by the cluster of nodes.
In one embodiment of the present invention, the system periodically sends checkpointing information from a packet distribution table (PDT) server node to a secondary PDT server node so that the secondary PDT server node is kept in a consistent state with the PDT server node. This allows the secondary PDT server node to take over for the PDT server node if the PDT server node fails.
In one embodiment of the present invention, the system periodically sends checkpointing information from a master PDT server node to at least one slave PDT server node so that the slave PDT servers are kept in a consistent state with the master PDT server.
In one embodiment of the present invention, the system examines the destination address to determine whether a service specified by the destination address is a scalable service that is provided by multiple server nodes, or a non-scalable service that is provided by a single server node. If the service is a non-scalable service, the system sends the packet to a service instance on the interface node.
In one embodiment of the present invention, if a new server becomes available for the service, the system adds at least one entry for the new server in the packet distribution table
Note that the mechanism for providing scalable services provided by the instant invention does not interfere with other non-scalable services, which are not distributed across nodes in the cluster of nodes.