High-performance computing (HPC) is often characterized by the computing systems used by scientists and engineers for modeling, simulating, and analyzing complex physical or algorithmic phenomena. Currently, HPC machines are typically designed using Numerous HPC clusters of one or more processors referred to as nodes. For most large scientific and engineering applications, performance is chiefly determined by parallel scalability and not the speed of individual nodes; therefore, scalability is often a limiting factor in building or purchasing such high-performance clusters. Scalability is generally considered to be based on i) hardware, ii) memory, input/output (I/O), and communication bandwidth; iii) software; iv) architecture; and v) applications. The processing, memory, and I/O bandwidth in most conventional HPC environments are normally not well balanced and, therefore, do not scale well. Many HPC environments do not have the I/O bandwidth to satisfy high-end data processing requirements or are built with blades that have too many unneeded components installed, which tend to dramatically reduce the system's reliability. Accordingly, many HPC environments may not provide robust cluster management software for efficient operation in production-oriented environments.
Typically, when a computer system experiences a hardware failure, software and data at a storage device coupled to computer system remain unavailable until the failure has been resolved (which may require replacing one or more hardware components of the computer system or replacing the entire computer system). Scientific and data-center applications often use clusters of commodity computer systems (such as PCs), but such clusters often lack fault tolerance and recovery capabilities.
Typically, a cluster of commodity computer systems includes one or more storage devices shared among the commodity computer systems for storing applications and application data. In such clusters, requirements imposed on the applications often necessitate the applications being integrated into software managing the clusters, processing at the applications being restricted, or both, which drives up complexity of applications providing fault tolerance in such clusters and drives up costs associated with developing such applications. Scientific and data-center applications often use clusters of commodity computer systems (such as PCs), but such clusters often lack fault tolerance and recovery capabilities. To provide at least some fault tolerance, such clusters often rely on shared-disk systems that use network file systems (NFSs) across Ethernet networks. Such systems are inadequate in HPC systems that require high-speed accessibility to applications, application data, or both.