Computational power in modern High Performance Computing (HPC) platforms is rapidly increasing. Moore's Law alone accounts for the doubling of processing power roughly every 18 months. A historical analysis of the fastest computing platforms in fact shows a doubling of compute power in HPC systems roughly every 14 months, with the first petaflop computing platform occurring in 2008. This accelerated growth trend is due largely to an increase in the number of processor cores in HPC platforms; the current fastest computer has roughly 265,000 cores. An increase in the number of cores imposes two types of burdens on the storage subsystem: (1) larger data volume and (2) more requests. The data volume increases because the physical memory per core is generally kept balanced, resulting in a larger aggregated data volume, typically on the of order petabytes for petascale HPC systems. More cores, however, also means more file system clients, more input/output (I/O) requests to the storage servers, and ultimately more seeking at the back-end storage media while storing that data. This results in higher observed latencies and lower performance.
HPC sites typically implement parallel file systems to optimize the I/O subsystem for checkpointing. Checkpointing is a procedure, executed from time to time on a HPC node, where the current state of an application is stored, typically on a disk-based storage system. Checkpointing, which involves periodic, heavy bursts of data followed by long latent periods, is the dominant I/O activity on most HPC systems. Because compute performance is greatly outpacing storage performance, storage systems are consuming an increasing percentage of the overall HPC machine budget. Consequently, storage systems now comprise an increasing number of distributed storage nodes. In the current environment, however, disk bandwidth performance greatly lags behind that of CPU, memory, and interconnects. This means that as the number of cores continues to increase and outpace the performance improvement trends of storage devices, disproportionaly larger and larger storage systems will be necessary to accommodate the equivalent I/O workload.
Typically, large parallel storage systems expose only a portion of their aggregate spindle bandwidth to the application being executed by an HPC system. Optimally, the only bandwidth loss in the storage system would come from redundancy overhead. In practice, however, the modules in HPC systems used to compose parallel storage system attain less than 50%, and around 40%, of their aggregate spindle bandwidth. These are several possible reasons for this: (1) the aggregate spindle bandwidth is greater than the bandwidth of the connecting bus; (2) the raid controller's parity calculation engine output is slower than the connecting bus; and (3) sub-optimal LBA (logical block addressing) request ordering caused by the filesystem. The first two factors are direct functions of the storage controller and may be rectified by matched input and output bandwidth from the host to disk. The last factor, which is essentially the “seek” overhead, is more difficult to overcome because of the codependence of the disk layer and filesystem on the simple linear block interface. The raid layer further complicates matters by incorporating several spindles into the same block device address range and forcing them to be managed in strict unison.
Since the data storage process ties up the compute nodes, the computational application is not running during this storage process, which reduces the net operational time of the computing system. HPC systems must have data stored frequently for recovery to avoid the potential loss of data due to overall system instability, or for post processing. The downtime in computational application operations that is associated with the data storage process creates a significant drain on the overall operations of the computing system.
Many parallel file systems address this problem by increasing the number of distributed storage nodes and making the data placement on disk more predictable, concentrating on the effective channeling of data to its final destination. This approach of increasing the number of storage nodes adds significant costs to the overall computing system.
In addition, today's parallel I/O infrastructures typically use two inferential systems for data storage that inhibit improvements in spindle bandwidth. They are: (1) object-based parallel file system metadata schema and (2) block-level RAID parity group association. Object-based parallel file systems use file-object maps to describe the locations of a file's data. These maps are key components to the efficiency of the object-storage method because they allow for arbitrary amounts of data to be indexed by a very small data structure composes merely of an ordered list of storage servers and a stride. In essence, the map describes the location of the file's sub-files and the number of bytes that may be accessed before proceeding to the subfile or stripe. Besides the obvious advantages in the art of metadata storage, there are several caveats to this process. The most obvious is that the sub-files are the static products of the object metadata model that was designed with its own efficiency in mind. The result is an overly deterministic data placement method that, by forcing I/O into a specific sub-file, increases complexity at the spindle because of the backing filesystem's block allocation schemes cannot guarantee sequentiality in the face of thousands or millions of simultaneous I/O streams.
RAID systems infer that every same-numbered block within the respective set of spindles are bound together to form a protected unit. This method is effective because only the address of a failed block is needed to determine the location of its protection unit “cohorts” with not further state being stored. Despite this inferential advantage, strict or loose parity clustering can be detrimental to performance because it pushes data to specific regions on specific disks.