High-performance computing (HPC) generally refers to the use of parallel processing systems to run application programs efficiently, reliably, and quickly. HPC allows scientists and engineers to solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities. Two common approaches used to support HPC include grid computing and cluster computing.
Cluster computing systems are typically assembled using large numbers of server systems and CPUs, for example, using racks of commodity servers connected by high-performance network interconnects. In contrast, grid computing systems are often assembled over a wide area using a heterogeneous collection of servers and CPU types. Typically, nodes in the grid can inform what computing resources are available (e.g., CPU cores and memory) to a manager node and the manager node can offer these resources to frameworks. The frameworks can then schedule tasks to run using the available resources. This approach has led to large scale grid systems sometimes referred to as “data center computing,” where rather than running several specialized clusters or grids for individual applications (each at relatively low utilization rates) mixed workloads all run on the same cluster. Doing so improves scalability, elasticity, fault tolerance, and performance utilization.
While enterprises can build and maintain the physical computing infrastructure to host both HPC grids and clusters, HPC systems can also be built using cloud computing services. Generally, cloud computing services allow an enterprise to access large amounts of computing resources without having to maintain an underlying physical infrastructure. For example, cloud computing services allow an enterprise to launch large numbers virtual computing instances—commonly referred to as virtual machines—which can run arbitrary applications. In addition to compute services (e.g., virtual machine instances), cloud computing services can also provide database, persistent storage, networking, load balancing, auto-scaling, messaging, cloud formation services, monitoring services, etc.
Moreover, cloud computing services may offer access to HPC grids or clusters. For example, a cloud computing service could allow an enterprise to launch and configure a fleet of virtual machine instances configured to operate as nodes in an HPC grid or cluster. Similarly, a cloud computing service could offer access to HPC grids using a variety of available frameworks and cluster management tools. Enterprises could use the HPC grid to access variable numbers of servers (or virtual machine instances) as grid nodes. One challenge faced by operators of HPC systems such as these is that changes or updates to the frameworks and cluster management applications should be tested before being put into production use.