1. Field of the Invention
The present invention relates to a fault tolerant multi computing system using group-to-group communication scheme.
2. Brief Description of the Related Arts
For the optimal resource utilization, flexibility and reduced management costs the industry demands solutions based on a “utility computing” model where processing power and storage capacity can be added as need and resources are provisioned dynamically to meet changing needs. Conventional mainframe solutions are beyond the reach of average enterprises due to high cost. There are large number of high performance but low-cost “blade servers” and networking technologies available in the market. However, a solution that aggregates these resources efficiently and flexibly and can run wide range of applications to meet the utility computing needs does not exist today.
The client-server paradigm is popular in the industry due to its simplicity in which a client makes a request and server responds with an answer. To enable this paradigm, a popular communications protocol used between a client and a server in a communication network is, transmission control protocol/Internet Protocol, or simply, “TCP/IP.” In the communication network, a client (or client system or machine) views a server (or server system or machine) as a single logical host or entity. A single physical server is often incapable of effectively servicing large number of clients. Further, a failed server leaves clients inoperable.
To address the shortcomings of a single physical server, cluster configurations having many servers running in parallel or grid to serve clients were developed using load-balancers. These configurations provide potential benefits, such as, fault-tolerance, lower cost, efficiency and flexibility comparable to mainframes. However, these and other benefits remain largely unrealized due to their inherent limitations and lack of a standard platform most applications can build on.
In addition to physical clustering, conventional software systems have also made efforts to introduce clustering at application level and operating system levels. However, shortcomings of such software configurations include instances where clustering is embedded in the application results in limited usage of those applications. Similarly, although operating system level clustering is attractive, conventional efforts in these areas have not been successful due to large number of abstractions that must be virtualized.
In contrast to physical server and software application and operating system clustering, network level clustering does not suffer from either of the problems and provides some attractive benefits. For example, the ability to address the cluster of server nodes as a single virtual entity is a requirement to be useful in client server programming. Further, the ability to easily create virtual clusters with a pool of nodes adds to better utilization and mainframe class flexibility.
A conventional network level-clustering platform must be generic and usable by a wide range of applications. These applications range from, web-servers, storage servers, database servers, scientific and application grid computing. These conventional network level clusters must enable aggregation of compute power and capacity of nodes, such that applications scale seamlessly. Existing applications must be able to be run with minimal no or changes. However, conventional network level clusters have had only limited success.
To the extent there has been any success of the Symmetric Multi-Processor (SMP) architecture, it can be attributed to the simplicity of the bus, which made processor and memory location transparent to applications. For clustering too, simplicity of a virtual bus connecting server nodes provides node location transparency and node identity transparency. However, such conventional systems lack the capability of allowing a bus to be directly tapped by client applications for efficiency. Similarly, buses based on User Datagram Protocol (“UDP”) packet broadcast and multicast lack data delivery guarantees, resulting in application level clustering.
The single most used protocol with delivery guarantees by the industry is TCP/IP. The TCP's data delivery guarantee, ordered delivery guarantee and ubiquity, makes it particularly desirable for virtualization. However, TCP's support for just two-end points per connection has limited its potential. Asymmetrical organization of processing elements/nodes that have pre-assigned tasks such as distributing incoming requests to cluster are inherently inflexible and difficult to manage and balance load. Asymmetrical nodes are often single point of failures and bottlenecks. In order for MC (Multi Computing) to succeed, there is a need for symmetrical organization as opposed asymmetrical node organization.
Another problem with asymmetry in a client-server environment is latency. Switches and routers employ specialized hardware to reduce latency of data passing through. When data must pass through node's UDP/TCP/IP stack, it adds significant latency due to copying and processing. Hence, in order to achieve optimal performance, systems must avoid passing of data through intervening nodes having asymmetric organization. However, if a server node's CPUs must handle large amount of network traffic, application throughput and processing suffers. Thus, conventional systems must use hardware accelerators such as specialized adaptor cards or Integrated Circuit chips to reduce latency at the endpoints and improve application performance. This increases system costs and complexity.
Low-cost fault-tolerance is a is highly desired by many enterprise applications. Solutions where fixed number of redundant hardware components are used suffer from lack of flexibility, lack of ability to repair easily and higher cost due to complexity. Solutions today offer high availability by quickly switching services to a stand-by server after fault occurred. As the stand-by systems are passive its resources only not utilized resulting in higher cost. In the simplest yet powerful form of fault tolerance by replication, the service over a connection continue without disruption upon failure of nodes.
On traditional clusters, an active node performs tasks and passive nodes later update with changes. In many instances, there are fewer updates compared to other tasks such as query. Machines are best utilized when load is shared among all replicas while updates are reflected on replicas. Replica updates must be synchronous and must be made in the same order for consistency. With atomic delivery, data is guaranteed delivered to all target endpoints, before client is sent with a TCP ACK indicating the data receipt. In the event of a replica failure, remainder of the replicas can continue service avoiding connection disruption to effect fault-tolerance. Non atomic replication lacks usability. Specifically, when a client request is received by replicas of a services, each produce a response. As client views server as a single entity it must be made sure that only one instance of the response is sent back to client. Similarly, when multiple client replicas attempt to send same request, it must be made sure that only one instance is sent out to server. Conventional systems often fail to provide atomicity, and therefore, lack fault tolerance avoiding connection disruption.
Another problem with conventional clustering systems is load balancing. As with any system, the ability balance load evenly among nodes is necessary for optimal application performance. However, conventional clustering systems provide only limited support for standard load balancing schemes, for example, round-robin, content hashed, and weighted priority. Moreover, many conventional clustering systems are unable to support implementing application specific load-balancing schemes.
Many services have load levels varying significantly in a cluster depending on time. Running processes may need to be migrated for retiring an active server. Conventional cluster systems often lack support for adding or removing nodes/replicas to cluster in a manner that is easily performed and without disrupting the service.
A number of attempts have been made to address network level virtualization. However, each attempt has still resulted in significant shortcomings. For example, one conventional solution is a device for balancing load in a cluster of Web-Servers is popular in the industry. This load-balancing device, which is also disclosed in U.S. Pat. Nos. 6,006,264 and 6,449,647, switches incoming client TCP connections to a server in a pool of servers. A conventional server for this process is Microsoft's Network Load balancer software, which broadcasts or multicasts client packets to all nodes by a switch or router. However, once a connection is mapped, the same server handles all client requests for the life of TCP connection in a conventional one-to-one relationship.
A problem with conventional systems such as the ones above is when a service is comprised of different types of tasks running on nodes, it fails to provide a complete solution because any mapped server that would not run all services client would request over a connection results in service failure. This limits the use of such systems to web-page serving in which only one task of serving pages is replicated to many nodes. In addition, any mapping of devices implemented external to a server is a bottleneck and results in a single point of failure. Further, because a connection has only two end points, replication is not supported. Therefore, with such single ended TCP, updates are not reflected on replicas, and hence, there are considerable limits on usability.
To address some of the shortcomings of the above conventional systems, other conventional systems attempted to distribute client requests over a connection to nodes serving different tasks. Ravi Kokku et al disclosed one such system, in “Half Pipe Anchoring.” Half pipe anchoring was based on backend forwarding. In this scheme when a client request arrives in the cluster of servers, a designated server accept the requests and after examination of the data, forwards to an optimal server. The optimal server, given with connection state information later responds to the client directly after altering the addresses to mach the original target address. Here a single TCP end-point is dynamically mapped to nodes to distribute requests. This scheme is an example of “asymmetric” approach in that an intervening node intercepts the data and distribute it based on data content.
Another conventional system attempting to achieve asymmetric organization is disclosed in two whitepapers by EMIC Networks Inc. In this conventional system, a designated node intercepts and captures incoming data and later reliably delivers it to multiple nodes, using proprietary protocols. Sometimes only one node is permitted to transmit data, and data must be transmitted first to a designated server which later retransmits it to client. Here also the single end-point is dynamically mapped and the TCP connection terminates at the intervening node where replication is initiated. This scheme is another example of “asymmetric” approach in that an intervening node intercepts the data and replicate it.
Both schemes described above maintain the TCP definition of two endpoints, although they may be mapped to different nodes. Replication in these conventional schemes is performed at the application level using proprietary protocols. Further, these conventional schemes employ asymmetric node organization, where select nodes act as application level router that distributes requests. However, such asymmetry results in scalability limitations as noted in “Scalable Content Aware Request Distribution in Cluster Based Network Servers” by Aaron et al. These limitations include a single point of failure, data throughput bottlenecks, suboptimal performance due to higher latency, and lack of location transparency.
Therefore, there is a need for a symmetric system and a method for using the current definition of TCP's two endpoints to provide m-to-n connections (m, n, being any integer, which may be the same to different).