High-performance computing (HPC) or cluster computing is increasingly used for a large number of computationally intense tasks, such as webscale data mining, machine learning, network traffic analysis, and various engineering and scientific tasks. In such systems, jobs may be scheduled to execute concurrently on a computing cluster in which application data is stored on multiple compute nodes.
Previous implementations of HPC clusters have maintained multiple node databases, between management and scheduler subsystems (with one-to-one mapping between the node-entries in each subsystem). This can lead to several problems, including the following: (1) Interaction between subsystems is informal and fragile; (2) scalability of a cluster is limited to the least scalable subsystem (for example, a system management subsystem may struggle if there are more than 1000 nodes); and (3) different types of HPC nodes may require different types of management and scheduling solutions.