High-performance computing (HPC) provides accurate and rapid solutions for scientific and engineering problems based on powerful computing engines and the highly parallelized management of computing resources. Cloud computing as a technology and paradigm for the new HPC era is set to become one of the mainstream choices for high-performance computing customers and service providers. The cloud offers end users a variety of services covering the entire computing stack of hardware, software, and applications. Charges can be levied on a pay-per-use basis, and technicians can scale their computing infrastructures up or down in line with application requirements and budgets. Cloud computing technologies provide easy access to distributed infrastructures and enable customized execution environments to be easily established. The computing cloud allows users to immediately access required resources without capacity planning and freely release resources that are no longer needed.
Each cloud can support HPC with virtualized Infrastructure as a Service (IaaS). IaaS is managed by a cloud provider that enables external customers to deploy and execute applications. FIG. 1 shows the layer correspondences between cluster computing and cloud computing models. The main challenges facing HPC-based clouds are cloud interconnection speeds and the noise of virtualized operating systems. Technical problems include system virtualization, task submission, cloud data input/output (I/O), security and reliability. HPC applications require considerable computing power, high performance interconnections, and rapid connectivity for storage or file systems, such as supercomputers that commonly use InfiniBand and proprietary interconnections. However, most clouds are built around heterogeneous distributed systems connected by low performance interconnection mechanisms, such as O-Gigabit Ethernet, which do not offer optimal environments for HPC applications. Table 1 below shows the comparison of technical characters between cluster computing and cloud computing models. Differences in infrastructures between cluster computing and cloud computing have increased the need to develop and deploy fault tolerance solutions on cloud computers.
Cloud ComputingCluster ComputingPerformance1.Computation costI.Computation costfactors2.Storage cost2.Communication latencies3.Data transfer cOSI (in or out3.Datu dependenciesfor each service4.SynchronizationPerformanceI.Specifying a particular servicel.Dcfining the data sizeTuningfor a particular task;to be distributed2.Archiving intermediate dura on2.Scheduling the send nnda particular storage device;receive workload3.Choosing a set of locations for3.Task synchronizationinput and output data.Fault1.RcseudJ.Checkpointing protocolsTolerance2.Reroute2.Membership protocol3.graph scheduling3.systelJl synchronization4.QoSGoalMinimizing the total cost ofMinimizing the totalexecution while meeting all theexeecution lime; performinguser-specified constraints.on users' hardware platforms,ReliabilityNoYesTask sizeSingle largeSmall and mediumScalableNoYesSwitchingLowHighApplicationHPC, HTCSME interactive