Computer clusters, or groups of linked computers, have been widely used to improve performance over that provided by a single computer, especially in extended computations, for example, involving simulations of complex physical phenomena. Conventionally, as shown in FIG. 1, in a computer cluster 10, compute nodes (also referred to herein as client nodes) 12 are linked by a High Speed Network (HSN) 14 which permits the sharing of the computers resources and memory.
Data transfers to and/or from the compute nodes are performed through the High Speed Network 14 and are managed by additional computer structure, also referred to as a File System 16. The File System includes File Servers 18 which file data from multiple compute nodes and assign a unique location for each compute node in the overall File System.
Typically, data migrate from the File Servers 18 to be stored in a data Storage Network 20, such as, for example, hard disk drives (HDD) (or Flash Memory) 22 arranged in storage devices arrays.
In a high performance compute cluster, applications periodically checkpoint the computed results of their simulations. Checkpointing is a technique for inserting fault tolerance into computing system. It basically includes the operation of storing a “snapshot” of the current application state, and subsequently using it for restarting the application execution in case of hardware failures which may cause the application to crash. These checkpoint results are generally stored into an HDD-based parallel file system 16 which is written to by many or all of an application's threads.
The compute cluster may assume either the compute state (or compute cycle) or the input/output (I/O) state (or I/O cycle), which are typically mutually exclusive. The process of checkpointing and moving data is carried out during the I/O cycle of the compute nodes, e.g., the data transfers are executed during time intervals when the computer activity has ceased. Since during the I/O cycle no actual computations occur, it is important to keep the I/O cycle as short as possible to maximize the overall computer duty cycle of the compute cluster.
In large compute clusters, the number of threads may approach millions, and in the coming years, application thread count will increase significantly. The thread count of large parallel applications increases proportionally to the core counts of today's microprocessors, where the trend over the last several years has been an increase in cores over increases in processor clock speed. The trend of increasing core counts will continue into the near future guaranteeing that large compute cluster applications will also become increasingly parallel.
As shown in FIG. 1, the applications issue “write” requests which are forwarded to I/O nodes 24 through the HSN 14. The I/O nodes 24 delay the “write” requests to the file system servers 18. All the data that needs to be written is copied in the I/O nodes 24 before the File System's “write” request is issued from CN (computer network) to the memory in the I/O nodes.
In servicing the file system “write” request, the data is copied to File System (FS) buffer 26 before being written to the storage devices 22 in the data storage network 20.
High degrees of application parallelism create challenges for the parallel file systems which are responsible for storing the applications' checkpoint output. This is largely due to the stringent coherency protocols and data layout policies employed by parallel file systems. These phenomena create hurdles for large applications which seek to store their output quickly and efficiently.
It is common for applications to store checkpoint data at a mere fraction of the parallel File System's peak performance. Unfortunately, it is uncommon for applications to actually achieve this peak. The primary culprits are serialization due to enforcement of coherency protocols and static file layouts which prevent dynamic load balancing. In the latter case, an application only stores as fast as is permitted by the slowest unit of storage in the parallel File System.
Since these prohibitive aspects of parallel File Systems limit the system's ability to scale horizontally (i.e. by adding more storage components to achieve increased throughput), new methods for storing checkpoint data have been developed to service the next generation of high performance compute clusters.
These new methods use a multi-fold approach aimed at dealing with the deficiencies of current parallel file system implementations, as well as incorporating a faster tier 28 of storage hardware. The new tier 28 of storage hardware shown in FIG. 2 is based on Non-Volatile Memory (NVM) technology which is positioned between the HDD-based parallel File System 16 and the application running on compute cluster 10. The tier 28 is called a “Burst Buffer tier,” also referred to herein as BB tier.
As shown in FIG. 2, in this implementation, the I/O nodes 24 are augmented with Burst Buffers 29 which form the BB tier 28. In this system, the applications checkpoint their state to the Burst Buffers 29 and resume computational activity once their output has been made durable in the Burst Buffer. The Burst Buffer tier's input performance is at least one order of magnitude faster than that of the HDD-based parallel File System 16. This increase in speed allows applications to complete their checkpoint activity in an expedited manner.
When an application issues “write” request with the intent of “pushing” the data to the BB tier 28, the “write” request at the Burst Buffer tier 28 will be directed to an algorithmically determined BB node. The data is written to NVRAM where the request is received, while the corresponding metadata are forwarded to the identified BB node, also referred to herein as a primary node.
At a later time, when the data residing in the BB node is to be forwarded to the File System 16, the metadata in the primary BB node will construct a File System data stripe from data fragments which possibly reside in multiple participating BB nodes. Once a buffer with the file system stripe data is ready at the primary BB node, it issues a File System (FS) “write” request, and a full data stripe is copied to the FS buffer 26 before being written to the data storage network 20.
However, this performance technique requires additional data management activities. The checkpoint data resident in a Burst Buffer (or some portion thereof) must be moved into the parallel File System at some point to make room for the next set of checkpoints. Furthermore, this movement must occur in an efficient manner, to minimize the impact of the parallel File System's innate inefficiencies.
Migrating data from the Burst Buffer tier 28 to the parallel File System 16 may be further complicated by hardware topology of the compute cluster 10. In many cases the individual compute servers utilize specific blade technologies aimed at improving density, cooling, and cabling efficiency. This may cause form factor limitations which limit the hardware configuration of an I/O node with a cluster server responsible for burst buffer or parallel file system activity.
For instance, an I/O node 24 may have a limited number of PCIe adapter slots which may be dedicated to either NVM cards or host-bus adapters (HBA) used to access the storage network 20 where the parallel File System 16 resides.
Further complicating the issue is the bandwidth differential between the Burst Buffer tier 28 and the parallel File System 16. In many situations, especially when the I/O node operates in view of the prescribed ratio of NVM bandwidth to storage network bandwidth, the performance differential may be improperly embodied within a single I/O node.
Even in cases where a cluster server may be designed as such, the number of links to the storage network 20 will far exceed the number necessary to saturate the bandwidth of the parallel File System, thus further increasing the impracticality of the approach.
To deal with these issues, the system's I/O nodes 24 may be divided into two groups, i.e., a burst buffer group 30 (or BBIO group) and parallel File System gateway group 32 (or PFSIO group), as shown in FIG. 3. By establishing two I/O node groups, system architects may tailor the system to meet the bandwidth demands of both the Burst Buffer tier and parallel File System without requiring unnecessary storage network hardware.
Compute cluster systems which utilize the I/O grouping strategy shown in FIG. 3, may however experience a new challenge, i.e., the efficient movement of data between the Burst Buffer I/O nodes (BBIO) 30 and the parallel File System gateway I/O nodes (PFSIO) 32.
When staging data from the BBIO 30 to the PFSIO 32, the staging process assigns an evenly divided percentage of BBIO nodes to each PFSIO node, and data fragments are directed as such. Unfortunately, this approach is prohibitive since it does not ensure that I/O to the parallel File System is formatted in a manner which will attain a high percentage of the parallel File System's peak bandwidth. One of the advantages of the Burst Buffer tier is to provide input which has been groomed in a manner germane to the parallel File System. Typically, this means coalescing data fragments which are logically adjacent in a file's address space. Coalesced buffers are aligned to the parallel File System's full data stripe size which is typically on the order of one to several Megabytes.
Due to the characteristics of the NVM, the BB tier is well suited to this task. In order to properly incorporate the PFSIO layer 32, coalescing of file fragments must be taken into account as data is moved from the BBIO nodes 30 to the PFSIO nodes 32.
As shown in FIGS. 4A and 4B, the full data stripe 34 received from the application, includes client data fragments 36, 38 and 40 which are distributed among BB nodes in a deterministic way beneficial for load balancing.
Under the assumption that the FS full data stripe 34 is distributed to the BBIOs, i.e., BB0, BB1 and BB2, and assuming the BB0 node is assigned the function of a primary BB node for the FS full data stripe “write” request 34, the BB0 node will allocate a full data stripe sized buffer 42 and initiate “read” requests to the participating BB nodes, i.e., BB1 and BB2, that hold the fragments 38 and 40, respectively, which make up the full data stripe 34. The data is transferred to the buffer 42.
Subsequently, the BB0 node will issue a File System “write” request. Since the File System server requires the data in its buffer before it is stored in the data storage network, the data stripe 34 is copied from the BB0 buffer 42 to the FS buffer prior to transfer to the File System.
However, this method causes overly extensive network traffic when data in the Burst Buffer tier is highly fragmented amongst the BBIO nodes. Also, an additional network transfer is required when the full data stripe size buffer 42 is moved between the BBIO and PFSIO sections.
Therefore, it would be highly desirable to provide a system where superfluous transfers could be avoided, and where the data migration is performed in a highly efficient manner.