Today a file system with billions of files, millions of directories and petabytes of storage is no longer an exception [32]. As file systems grow, users and administrators are increasingly keen to perform complex queries [40], [50], such as “How many files have been updated since ten days ago?”, and “Which are the top five largest files that belong to John?”. The first is an example of aggregate queries which provide a high-level summary of all or part of the file system, while the second is top-k queries which locate the k files and/or directories that have the highest score according to a scoring function. Fast processing of aggregate and top-k queries are often needed by applications that require just-in-time analytics over large file systems, such as data management, archiving, etc. The just-in-time requirement is defined by two properties: (1) file-system analytics must be completed with a small access cost—i.e., after accessing only a small percentage of directories/files in the system (in order to ensure efficiency), and (2) the analyzer holds no prior knowledge (e.g., pre-processing results) of the file system being analyzed. For example, in order for a librarian to determine how to build an image archive from an external storage media (e.g., a Blue-ray disc), he/she may have to first estimate the total size of picture files stored on the external media—the librarian needs to complete data analytics quickly, over an alien file system that has never been seen before.
Unfortunately, hierarchical file systems (e.g., ext3 and NTFS) are not well equipped for the task of just-in-time analytics [46]. The deficiency is in general due to the lack of a global view (i.e., high-level statistics) of metadata information (e.g., size, creation, access and modification time). For efficiency concerns, a hierarchical file system is usually designed to limit the update of metadata information to individual files and/or the immediately preceding directories, leading to localized views. For example, while the last modification time of an individual file is easily retrievable, the last modification time of files that belong to user John is difficult to obtain because such metadata information is not available at the global level.
Currently, there are two approaches for generating high-level statistics from a hierarchical file system, and thereby answering aggregate and top-k queries: (1) The first approach is to scan the file system upon the arrival of each query, e.g., the find command in Linux, which is inefficient for large file systems. While storage capacity increases at approximately 60% per year, storage throughput and latency have much slower improvements. Thus the amount of time required to scan an off-the-shelf hard drive or external storage media has increased significantly over time to become infeasible for just-in-time analytics. The above-mentioned image-archiving application is a typical example, as it is usually impossible to completely scan an alien Blue-ray disc efficiently. (2) The second approach is to utilize prebuilt indexes which are regularly updated [3], [7], [27], [35], [39], [43]. Many desktop search products belong to this category, e.g., Google Desktop [24] and Beagle [5].
While this approach is capable of fast query processing once the (slow) index building process is complete, it may not be suitable or applicable to many just-in-time applications. For instance, index building can be unrealistic for many applications that require just-in-time analytics over an alien file system. Even if index can be built up-front, its significant cost may not be justifiable if the index is not frequently used afterwards. Unfortunately, this is common for some large file systems, e.g., storage archives or scratch data for scientific applications scarcely require the global search function offered by the index, and may only need analytical queries to be answered infrequently (e.g., once every few days). In this case, building and updating an index is often an overkill given the high amortized cost.
There are also other limitations of maintaining an index. For example, prior work [49] has shown that even after a file has been completely removed (from both the file system and the index), the (former) existence of this file can still be inferred from the index structure. Thus, a file system owner may choose to avoid building an index for privacy concerns.