The present invention relates to computer processing, and more specifically, to transferring data between a compute node and a file system.
Applications running on compute nodes often need to access common files. In high performance computing systems (HPC), the files are traditionally stored on large parallel file systems, such as GPFS (general parallel file system) or Lustre. HPCs place a large demand on file systems to deliver data to the compute nodes upon demand fetch/store. Programs running on these HPCs often generate or use data faster than the data can be written to or read from the file system. One particular concern is checkpoint files. For some systems, a single checkpoint might exceed two petabytes. Assuming peak throughput of the file system, these checkpoint files may take around 20-30 minutes to read the file. It takes another 20-30 minutes to write each new checkpoint to disk. This is data that needs to be transferred before a job on the compute node can start or resume any useful calculation.
Classical file transfer would be performed as a file-level cp (copy), my (move), rsync, or ftp (file transfer protocol). These solutions use the processors and operating system to read/write the file system data on both sides of the communication. They also use CPU (central processing unit) cycles to transfer the data. This activity robs performance away from applications that are running on the system.
In addition, these types of classical file transfers can cause “jitter” on the compute nodes. Jitter refers to the interference experience by an application due to scheduling of background processes and handling of asynchronous events such as interrupts. Aggregate interrupts across the compute nodes occur at random times with respect to each other, but the compute nodes may be working simultaneously on the same job. The operating system (OS) is interrupted, data is copied into the file system, data is instantiated in the file system, then data is pushed to a solid state disk (SSD) or other non-volatile storage associated with the compute node. With systems that may have 5,000 or more compute nodes, this type of OS noise degrades performance of the system.
An alternative solution might be to extend the parallel file system, using the compute node SSDs as a large cache. This solution adds complexity to the file system since it would now need to know compute node affinity for certain files. This solution would also likely use CPU cycles to manage coherency between the disjoint compute nodes. File systems may also run into scaling issues with 5,000 or more file servers.