Operating systems generally manage and store information on one or more memory devices using a file system that organizes data in a file tree. File trees identify relationships between directories, subdirectories, and files.
In a distributed file system, data is stored among a plurality of network nodes. Files and directories are stored on individual nodes in the network and combined to create a file tree for the distributed file system to identify relationships and the location of information in directories, subdirectories and files distributed among the nodes in the network. Files in distributed file systems are typically accessed by traversing the overall file tree.
Occasionally, a file system may scan a portion or all of the files in the file system. For example, the file system or a user may want to search for files created or modified in a certain range of dates and/or times, files that have not been accessed for a certain period of time, files that are of a certain type, files that are a certain size, files with data stored on a particular memory device (e.g., a failed memory device), files that have other particular attributes, or combinations of the foregoing. Scanning for files by traversing multiple file tree paths in parallel is difficult because the tree may be very wide or very deep. Thus, file systems generally scan for files by sequentially traversing the file tree. However, file systems, and particularly distributed file systems, can be large enough to store hundreds of thousands of files, or more. Thus, it can take a considerable amount of time for the file system to sequentially traverse the entire file tree.
Further, sequentially traversing the file tree wastes valuable system resources, such as the availability of central processing units to execute commands or bandwidth to send messages between nodes in a network. System resources are wasted, for example, by accessing structures stored throughout a cluster from one location, which may require significant communication between the nodes and scattered access to memory devices. The performance characteristics of disk drives, for example, vary considerably based on the access pattern. Thus, scattered access to a disk drive based on sequentially traversing a file tree can significantly increase the amount of time used to scan the file system.