Cluster Computing
Conceptually, computer clusters or grids are a collection of computing resources (e.g., computers, servers, storage devices or systems, printers, scientific instruments, etc.) connected through a network or networks. Cluster middleware aggregates these resources and provides access to the resources when needed. Typically, a cluster computing system may include compute nodes configured to execute jobs and one or more nodes that implement the middleware; these nodes may be referred to as management nodes, with the compute nodes being among the managed resources. Generally, in these cluster computing systems, a job submitter submits jobs to the cluster. The middleware dispatches the jobs to various compute nodes. The compute nodes perform their assigned jobs and return results, for example back to a management node which aggregates results from one or more compute nodes and provides the results to the job submitter.
Metadata Storage
Many cluster computing systems generate metadata that is used, for example, in tracking the configuration and availability of resources, in tracking the distribution, status and progress of jobs on the cluster, and possibly for other information that may be used in some cluster applications such as order, shipping, and delivery information. Job metadata may be generally defined as including any information that may be used in executing jobs in a cluster system. Many conventional cluster systems employ a centralized database or databases to store this metadata. The centralized database or databases are “fixed”; that is, the database(s) reside on servers or storage systems. Users may access, or may request access to, the databases, for example to determine the status of jobs, orders, shipping, delivery, and so on. However, as a cluster system grows, the fixed, centralized database architecture may result in heavy load on the databases, reducing the cluster's ability to scale. Thus, the fixed, centralized database architecture may be a bottleneck in conventional cluster systems.
Metadata Transport
Many cluster computing systems transport job metadata, for example between management nodes and compute nodes or between cluster nodes and a centralized database, according to a transport architecture that employs some protocol, for example via XML encoded structures (SOAP, XML-RPC) or via a proprietary protocol (ICE, raw sockets etc). This conventional transport architecture requires many protocol messages encapsulating various metadata to be passed between cluster nodes; these messages are often deserialized, parsed, modified, and serialized at the nodes, which requires CPU processing time. Thus, this conventional transport architecture may result in performance bottlenecks due to CPU and network bandwidth requirements to support this transporting and node processing of many protocol messages.
Cluster computing systems that do not transport metadata for jobs according to the above transport architecture may instead allow direct access to the centralized database(s). However, this architecture may result in scalability issues since access to the centralized database(s) generally have a fixed available bandwidth.
Cluster Resource Management
Conventional cluster computing systems exist that may manage a collection of network resources. However, these conventional systems typically involve a bulky infrastructure that requires significant setup and management by system administrators. Examples of such conventional cluster systems include Oracle Corporation's Grid Engine technology (formerly Sun Grid Engine technology), and the SETI@Home project. Generally, in these conventional cluster computing systems, it is required that a managed node (e.g., a compute node) has an installed client or agent that communicates with one or more management nodes. The agent relays status, performance, and availability information for the managed node to the management node(s); the management node(s) (the middleware) make job distribution decisions according to the information received from the managed node(s). However, these conventional cluster computing systems generally use a polling technique in which a management node or nodes periodically poll the managed resources on the cluster to gather this information. This polling generates considerable network traffic, which consumes available bandwidth and thus adds significantly to the load of the cluster system. Furthermore, these conventional cluster computing systems generally restrict which types of systems may be used as cluster resources, since a node must be able to support the agent provided by the infrastructure.