1. Technical Field
The present disclosure relates to clusters and more specifically a system and method of creating a virtual private cluster.
2. Introduction
The present disclosure applies to computer clusters and computer grids. A computer cluster can be defined as a parallel computer that is constructed of commodity components and runs commodity software. FIG. 1 illustrates in a general way an example relationship between clusters and grids. A cluster 110 is made up of a plurality of nodes 108A, 108B, 108C, each containing computer processors, memory that is shared by the processors in the node and other peripheral devices such as storage discs connected by a network. A resource manager 106A for the node 110 manages jobs submitted by users to be processed by the cluster. Other resource managers 106B, 106C are also illustrated that can manage other clusters (not shown). An example job would be a weather forecast analysis that is compute intensive that needs to have scheduled a cluster of computers to process the job in time for the evening news report.
A cluster scheduler 104A can receive job submissions and identify using information from the resource managers 106A, 106B, 106C which cluster has available resources. The job would then be submitted to that resource manager for processing. Other cluster schedulers 104B and 104C are shown by way of illustration. A grid scheduler 102 can also receive job submissions and identify based on information from a plurality of cluster schedulers 104A, 104B, 104C which clusters can have available resources and then submit the job accordingly.
Several books provide background information on how to organize and create a cluster or a grid and related technologies. See, e.g., Grid Resource Management, State of the Art and Future Trends, Jarek Nabrzyski, Jennifer M. Schopf, and Jan Weglarz, Kluwer Academic Publishers, 2004; and Beowulf Cluster Computing with Linux, edited by William Gropp, Ewing Lusk, and Thomas Sterling, Mass. Institute of Technology, 2003.
FIG. 2 illustrates a known arrangement 200 comprising a group of computer clusters 214, 216, 218 consisting of a number of computer nodes 202, 204, 206, each having a group of memory disks, swap, local to the computer itself. In addition, there can exist a number of services that are a part of that cluster. Block 218 comprises two components, a cluster 202 and a storage manager 212 providing network storage services such as LAN-type services. Block 218 illustrates that the network storage services 212 and the cluster or object 202 are organized into a single and independently administered cluster. An example of this can be a marketing department in a large company that has an information technology (“IT”) staff that administers this cluster for that department.
Storage manager 212 can also communicate with nodes or objects 204 in other clusters such as are shown in FIG. 1. Block 216 shows a computer cluster 204 and a network manager 210 that communicate with cluster 204 and can impact other clusters, shown in this case as cluster 202 and cluster 206.
Block 214 illustrates a computer cluster 206 and a software license manager 208. The license manager 208 is responsible for providing software licenses to various user applications and it ensures that an entity stays within bounds of its negotiated licenses with software vendors. The license manager 208 can also communicate with other clusters 204 as shown.
Assuming that computer clusters 214, 216 and 218 are all part of a single company's computer resources, that company would probably have a number of IT teams managing each cluster 216, 214, 218. Typically, there is little crossover or no crossover between the clusters in terms of managing and administration from one cluster to another other than the example storage manager 212, network manager 210 or license manager 208.
There are also many additional services that are local and internal to each cluster. The following are examples of local services that would be found within each cluster 214, 216, 218: cluster scheduling, message passing, network file system auto mounter, network information services and password services are examples of local services shown as feature 220 in block 214. These illustrate local services that are unique and locally managed. All of those have to be independently managed within each cluster by the respective IT staff.
Assuming that a company owns and administers each cluster 218, 216 and 214, there are reasons for aggregating and partitioning the compute resources. Each organization in the company desires complete ownership and administration over its compute resources. Take the example of a large auto manufacturing company. Various organizations within the company include sales, engineering, marketing and research and development. The sales organization does market research, looking at sales, historical information, analyzing related data and determining how to target the next sales campaign. Design graphics and rendering of advertising can require computer processing power. The engineering department performs aerodynamics and materials science studies and analyses. Each organization within the company has its own set of goals and computer resource requirements to make certain they can generate its deliverables to the customers.
While this model provides each organization control over their resources, there are downsides to this arrangement. A large cost is the requirement for independent IT teams administering each cluster. There is no opportunity for load balancing where if the sales organization has extra resources not being used, there is no way to connect these clusters to enable access by the engineer teams.
Another cause of reduced efficiency with individual clusters as shown in FIG. 1 is over or under restraining. Users who submit jobs to the cluster for processing desire a certain level of response time according to their desired parameters and permissions. In order to insure the response time, cluster managers typically must significantly over-specify the cluster resources to get the results they want or control over the cycle distribution. When a job is over-specified and then submitted to the cluster, often the job simply does not utilize all the specified resources. This process can leave a percentage of the resources simply unused.
What is needed in the art is a means of maintaining cluster partitions but also sharing resources where needed to improve the efficiency of a cluster or a group of clusters.