Very large scale computer resource cluster systems, particularly large scale cellular machines, introduce significant system management challenges. The ability to track and analyze every possible fault condition, whether it's a transient (soft) or permanent (hard) condition, in large cellular machines is a major issue from the points of view of systems software, hardware, and architecture. The difficulty is primarily due to the fact that the number of entities to be monitored is so large that interaction between the management system and the managed entities is overwhelmingly complex and expensive.
There are a number of available system management tools for clusters of computer resources. However, the existing technologies typically target small to medium size clusters. Typically a cluster resource management system consists of one or a plurality of centralized control workstations (CWSs) with all of the nodes reporting to the CWS being termed Client nodes (C-nodes). Small and medium size cluster management approaches cannot be directly applied to a system which is at least two orders of magnitude larger than the existing systems for the following reasons:
1. There is no clear road map or scalability feature addressed in the current systems to scale up to a very large cluster (e.g. 65536 nodes).
2. Most available tools are based on the popular operating systems (e.g., Linux, AIX, or Solaris) and applying them to specialized operating systems is an overwhelming task.
3. Many existing tools rely on a centralized control point, called a centralized control workstation (CWS), which both limits the size of the cluster and becomes a single point of failure for the cluster operation.
FIGS. 1 and 2 depict representative prior art hierarchical approaches to cluster management. A three-level cascading model is shown in FIG. 1 with two different levels of CWSs, specifically server node 101 over midlevel server nodes 110, 120 and 130, wherein midlevel server 110 manages client nodes 115, 117, and 119, midlevel server 120 manages client nodes 125, 127, and 129, and midlevel server 130 manages client nodes 135, 137, and 139. Alternatively, a very powerful centralized CWS can be provided to handle several thousands of C-nodes simultaneously. As illustrated in FIG. 2, centralized management server 201 directly manages the client nodes 210, 220, 230, 240, 250, and 260 in a standard two-level hierarchical system.
However, each of the foregoing approaches not only introduces more complexity and more resources, but also reduces the reliability and performance of the system significantly because of the load on the central server and the presence of many single points of failure.
Therefore, it is apparent that the current technologies may not be directly applied to very large clusters since they cannot be easily scaled up to manage large numbers of computers (e.g., 65536 nodes). Even with multiple CWSs, it would be necessary to introduce another level of management, which again introduces more complexity and at least on other single point of failure at the top management level.
It is therefore an objective of the present invention to provide a management system and method for clustered computer resources which is scalable to manage very large cluster.
It is another objective of the present invention to provide a management system and method for clustered computer resources which is flexible to react to fail-over conditions.