Distributed computing systems have found application in a number of different computing environments, particularly those requiring high performance and/or high availability and fault tolerance. In a distributed computing system, multiple computers connected by a network are permitted to communicate and/or share workload. Distributed computing systems support practically all types of computing models, including peer-to-peer and client-server computing.
One particular type of distributed computing system is referred to as a clustered computing system. “Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a client or user, the nodes in a cluster appear collectively as a single computer, or entity. In a client-server computing model, for example, the nodes of a cluster collectively appear as a single server to any clients that attempt to access the cluster.
Clustering is often used in relatively large multi-user computing systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
As the complexity of and demands placed on clustered and other distributed computing systems increases, scalability and performance become increasing concerns. It is not unreasonable to expect a distributed computing system to provide services for potentially millions of clients, and it has been found that as the complexity of the distributed computing systems used to service these clients increases, the distribution of workload between the servers, nodes or other computers constituting such systems becomes increasingly more critical to the stability and performance of such systems. At the forefront of appropriately distributing the workload is the routing of client requests to appropriate computers, e.g., servers, in a distributed computing system.
Effectively coordinating the routing of requests from potentially millions of clients has become a significant factor in the overall performance of a distributed computing system. Routing protocols require not only the even distribution of workload across the available servers, but also the ability to handle the unavailability of certain servers, as well as the distribution of services among only subsets of servers.
It has been found that centralized routing, where client requests are all initially sent to one server or component and thereafter routed to the appropriate server for handling, can be both a source of failure and a bottleneck on performance. In larger distributed computing systems, it has been found that offloading routing decisions to the clients themselves, or otherwise to components that serve as proxies for the clients, can overcome many of the obstacles associated with a centralized routing approach. In many conventional designs, clients that make their own routing decisions are referred to as “smart clients.”
In order for clients to make the correct decision on where to route client requests, clients must be provided with routing information that can be used to make educated decisions. Typically, even in the client-side routing approach, the routing information is generated and updated on the servers, or otherwise outside of the clients, due to the fact that the overhead associated with monitoring the status of the distributed computing system can be prohibitively large for a client, and in some instances, clients may not have access to some of the system status information required to make educated routing decisions. As routing information is centrally updated, the routing information is then propagated out to the clients to update local routing information stored on each of the clients.
Many conventional designs use a “push” or epoch approach, whereby the propagation of routing information to clients is initiated by a server or other central component whenever the routing information on the server has changed. By doing so, clients are assured of having the most up-to-date routing information available most of the time.
As distributed computing systems become more complex, powerful and dynamic, however, the routing information maintained in such systems becomes significantly more dynamic in nature. Servers may crash or become bogged down, additional servers may be brought online, services may be added or removed, or may be moved to different servers. The workloads of individual servers may change, as may the number of clients vying for the limited system resources. Servers may also experience changes in resource usage (e.g., pending requests, threads, CPU usage, memory usage, I/O usage, data lock usage, etc.) and may change in configuration. As a result, the optimal routing information for a distributing computing system is constantly in a state of flux.
Under conventional protocols, updates to routing information resulting from changes in system configuration and operating conditions are propagated to all clients, and as such, in complex distributed computing systems that are constantly changing, and that serve potentially millions of clients, the overhead associated with propagating current routing information to all of the clients can be overwhelming.
Therefore, a significant need exists in the art for a more efficient and less costly manner of propagating the routing information used for routing client requests in a distributed computing system.