It is common in high-performance computing environments and other information processing system applications for multiple compute nodes to access a shared file system. For example, high-performance computer systems such as supercomputers typically include large numbers of compute nodes that access a parallel file system, distributed file system or other type of cluster file system. A cluster file system as the term is broadly used herein generally allows multiple client devices to share access to files over a network.
Well-known examples of cluster file systems include the Lustre file system and distributed file systems such as Hadoop Distributed File System (HDFS). These and other file systems utilized by high-performance computer systems can readily scale to support tens of thousands of clients, petabytes of storage, and hundreds of gigabytes per second of aggregate input-output (IO) throughput.
A problem that arises in these and other information processing system applications relates to the handling of small data files generated by processes running on the various compute nodes. If a large number of such data files are generated substantially concurrently by multiple compute nodes, an excessive number of accesses to the file system may be required, thereby undermining the IO throughput performance.