A great deal of demand is placed on high-performance input/output (I/O) systems by data-intensive scientific/commercial applications running on today's most powerful computing systems. While most computationally intensive challenges are handled by emerging massively parallel systems with thousands of processors (e.g. IBM's Blue Gene/L), data-intensive computing with scientific and non-scientific applications still continues to be a major area of interest due to the gap between computation and I/O speed. The seamless transfer of data between memory and a file system for large-scale parallel programs is crucial for useful performance in a high-performance computing environment.
A scalable parallel I/O support in a parallel supercomputer consists mainly of high performance file systems and effective parallel I/O application programming interfaces (API). There have been many efforts in developing parallel file systems for supercomputers, such as GPFS (General Parallel File System) [Refer to F. B. Schmuck and R. L. Haskin, GPFS: a shared-disk file system for large computing clusters, in proceedings of Conference of Files and Storages Technologies (FAST'02), 2002] for IBM SP systems as well as Linux clusters, PVFS [See “PVFS2 Parallel Virtual File System,”] and Lustre [See “Lustre scalable storage,” Copyright 2006 Cluster File Systems] for Linux-based platforms. In terms of application programming interfaces, MPI (Message Passing Interface) I/O is synonymous with parallel file I/O for scientific computing, because of its wide use and its base in MPI. MPI I/O supports relatively rich file access patterns and operations for concurrently accessing a single file, which allows aggressive optimizations to be integrated. However, in order to explore I/O performance of data-intensive applications, parallel I/O supports are needed at various levels such as: computation system, file system, and parallel I/O application programming interface (API). GPFS is highly optimized for large-chunk I/O operations with regular access patterns (contiguous or regularly striped). On the other hand, its performance for small-chunk, non-contiguous I/O operations with irregular access patterns (non-constant striped) is less optimized. Particularly, concurrent accesses (from distinct processes) to different file regions in the same GPFS striping-block introduce additional file system activities associated with its file locking mechanism, which can hurt performance.
In the past the extent of computations far exceeded the file I/O operations for scientific applications. However, many of today's large-scale scientific operations require real time data input and output (I/O) thereby increasing demands to provide both large-scale computation and file I/O either simultaneously or at regular or irregular intervals. Such demands for large scale computing result in demands to have both large chunk and small chunk file access both at regular and irregular intervals.
File I/O processes within a computer processor (compute node) start with an initiative from an application communicating with libraries (usually libc) through file scanning operations (fseek). The libraries initiate any read operation within the compute node. A compute node kernel is usually assisted by an input/output (I/O) daemon to talk to the outside world for modern network-based file system (NFS). Any communication to the outside world (including other compute nodes, file system or disks) is carried out through computer networks (referred to as communication tree, Ethernet) by means of sending data as communication packets. These packets normally contain the output results which are communicated to file servers (or disks) to be stored or for any file I/O for further processing or future use.
Referring to FIG. 1 there is shown an example of a basic file I/O mechanism for a simple single-node computer system 100. The Central Processing Unit (CPU) 101 performs a read function, which is re-issued by the operating kernel. The operation is a request to transfer data from disk 104 to memory 103 regions. The CPU directly or indirectly (e.g. through DMA) controls the disk and memory to perform the data transfer. For general cases, the data transfer does not involve cache memory 102.
Referring to FIG. 2 there is shown a simple illustration of hardware and software stacks associated with a file I/O mechanism 200. The single node computer system 100 contains the operating system (OS) 201 and an interface for accessing the file server 203 through the Ethernet 202.
In the case of parallel supercomputers, however, the presence of multiple computation cores, multiple file I/O daemons and file servers makes the scheduling, assignments and allocation of file I/O disks or file servers very complex. Depending on any requirements (either large scale computation, file I/O) the number of compute nodes, associated file I/O daemons and file servers are varied to minimize or maximize the application initiated operations. One common procedure followed by parallel supercomputers in recent years to handle large-chunks of file I/O followed by a big phase of computation is by separating the computation and I/O operations and carrying them out by means of separate compute and I/O intensive processors. By assigning separate processors for computations and I/O operations it is easier not only to parallelize the computations and I/O operations, but it is also efficient to optimize and minimize many of the data distributions, hence minimizing any data traffic and operations between different computer processors and file servers.
Referring to FIG. 3 there is shown a typical compute node (CN) 301 (also referred to as a single node computer system 100 specializing in computation) and I/O node 302 (computer processor specializing in and running I/O daemons). A file server 203 for carrying out file I/O in a typical parallel supercomputer is also shown. The compute node kernel 201 interacts with a computation network 303 and an internal I/O network 304. This illustrates the separation of computation and I/O functionalities for a computer system with at least one node functioning as a compute node 301 and another node functioning as an I/O node 302. The dedicated I/O node 302 contains its own kernel 305 that supports normal file I/O operations.
Referring to FIG. 4 there is shown an example of a computing system optimized for high-performance file I/O that explores state-of-the-art technologies. The system is composed of three primary components: a cluster of compute nodes 301 (CN), a storage sub-system 402, and Ethernet fabric 403 that connects the cluster of compute nodes 301 to the storage sub-system 402. In the system, the storage sub-system 402 can be viewed as file servers to provide data to the compute nodes 301, which in turn can be viewed as file clients. The storage sub-system 402 is composed of magnetic disks 104 (physical persistent storage), and Network Shared Disks (NSD) 404 (a cluster of computers collectively optimizes the accesses to the disks). The disks connect to NSD 404 via a Storage Area Network (SAN) fabric 405 (a type of network designed to attach computer storage devices).
Referring now to FIG. 5, a dedicated computation network 501 is used for communication among compute nodes. Some compute nodes 301 are chosen as I/O aggregators 502. Among all the compute nodes, only the I/O aggregators 502 communicate to the backend file system, which is composed of an array of NSD (network shared disks) servers 404, the storage area network (SAN) fabric 405, and the actual disks 104. The interconnect between the I/O aggregators 502 and the NSD servers 404 is Ethernet 403.
In terms of a parallel I/O application programming interface, the most widely-used interface appears to be the POSIX file I/O interface (e.g. POSIX write( )/read( ) system calls). However, such an interface is not efficient enough for parallel I/O. First, it only supports contiguous I/O requests. Second, it does not provide any means for concurrent processes of a large-scale parallel application (usually using message passing for inter-process communication) to coordinate their I/O requests. Modern parallel file systems usually deliver high data-throughput for file I/O requests of disk data in large, contiguous chunks. On the other side, most data-intensive applications issue many file data requests having small-chunk, non-contiguous patterns. There is a requirement for a mechanism or method to handle these different scenarios in a manner that is transparent to a user.
Existing parallel file systems (for example GPFS) do a good job in terms of continuous and regular striped file access with large chunk I/O operations. However, for small chunk and irregular file access patterns frequently encountered for scientific applications, existing parallel file systems do not deal properly, leading to severe performance penalties.
Therefore, there is a need for a solution that meets such demands by delivering useful performance for massive parallel file systems.