Operating systems for data processing computers typically provide an interface between the application programs running on the computers and files which are physically stored on disks (e.g., optical or magnetic) using a hierarchical or "tree" file structure. The hierarchical file structure of the disks includes a root directory, sub-directories located at any of several levels from the root directory, and multiple files typically stored in the sub-directories. Each file on the disk may be accessed using either its absolute pathname to specify a path from the root directory to the file, or using its relative pathname to specify a path from the current working directory (CWD) to the file. Each directory or file which is specified in the pathname forms a component of the pathname for the given file.
Referring to FIG. 1, an exemplary hierarchical file structure includes a root directory 10 and sub-directories 12 and 14 located at a first level from root directory 10. Sub-directories 16 and 18 are located in sub-directory 12 at a second level from root directory 10. Sub-directories 20 and 22 and file 24 are located in sub-directory 16 at a third level from root directory 10. Files 26 and 28 are stored in sub-directory 20, and files 30 and 32 are stored in sub-directory 22, at a fourth level from root directory 10. Sub-directory 34 and file 36 are located in sub-directory 18 at the third level, and file 38 is stored in sub-directory 34 at the fourth level. Similarly, sub-directories 40 and 42 are located in sub-directory 14 at the second level. Files 44, 46 and 48 are stored in sub-directory 40 at the third level. Sub-directory 50 and file 52 are stored in sub-directory 42 at the third level, and files 54 and 56 are stored in sub-directory 50 at the fourth level. Thus, a given file is stored at the end of a path which starts at root directory 10 and passes through possibly several sub-directories. A given file can be accessed using either its absolute or its relative pathname plus an appended filename, referred to hereinafter as a "pathname/filename." For example, assuming that the CWD is DIR3, the absolute pathname/filename of file 32 (i.e., FILE10) can be specified as /DIR1/DIR3/DIR8/FILE10, and its relative pathname/filename can be specified as DIR8/FILE10. The manner in which file 32 (i.e., FILE10) is accessed will be used throughout this document for illustrative purposes. The same techniques could be used to access any other file in the structure. As will be readily apparent to a person of skill in the art, other exemplary file structures may have more or fewer levels, and more or fewer sub-directories and files located at any directory or sub-directory at any level from the root.
The operating system interface to the directories and files that are physically stored on a disk typically includes a set of interface functions or procedures which are designed to access the files in a standard manner. For example, the OS/400 operating system, available from International Business Machines, Inc. (IBM) of New York, includes a set of application programming interface (API) functions such as open(), close(), read(), write() and stat(). These API functions are operating system services that can be called or invoked by application programs to access files or directories stored on disk or another memory media installed in the computer, thereby avoiding the need for application programs to interface directly with the physical media. Thus, the API function set provides system services which allow application programs to be easily interfaced with files stored on any physical media, including magnetic disks, optical disks (e.g., CD-ROMs), etc.
Each call or invocation of a file interface function or procedure that takes a pathname/filename as a parameter (e.g., the opening of a file such as file 32 in FIG. 1) typically requires the operating system to access the drive for the disk and to search through all of the components (i.e., directories and sub-directories) specified in the file's pathname. The number of searches will depend upon the number of components in the pathname. For example, when file 32 is specified by its absolute pathname, beginning at root directory 10, the operating system will search for sub-directory 12 in root directory 10, then search for sub-directory 16 in sub-directory 12, then search for sub-directory 22 in sub-directory 16, and finally search for file 32 in sub-directory 22. Thus, four searches are required since file 32 is located at the fourth level from root directory 10. Similarly, when the CWD is DIR3 and file 32 is specified by its relative pathname/filename, the operating system will first find sub-directory 16, then search for sub-directory 22 in sub-directory 16, and finally search for file 32 in sub-directory 22. Thus, two directory searches will be needed since file 32 is two levels from the CWD. Each search generally requires a separate access of the physical disk which stores a representation of the directory.
"Namespace resolution" refers to the process of converting or translating a hierarchical pathname/filename into a representation of the file named by the path. In the context of the OS/400 operating system, namespace resolution refers to translation of the pathname for a given file or directory into the in-core or "vnode" representation of the given file or directory. For absolute pathname/filenames (i.e., those starting with a "/"), translation starts at the root directory of the namespace. For relative pathname/filenames (i.e., those not starting with a "/"), translation starts at the CWD. To illustrate namespace resolution by the OS/400 operating system, assume that the pathname/filename to be translated is "/DIR1/DIR3/DIR8/FILE10". Namespace resolution of this pathname begins by searching the root directory ("/") for the first component of the pathname (i.e., "/DIR1"). The logical file system ("LFS") of the OS/400 operating system searches for a component in a directory by calling the "vn.sub.-- lookup" operation of the vnode representing the directory using the name of the component as an input to the operation. When invoked, the operation generally must access the on-disk representation of the directory to search for the named component. If the search is successful, the operation returns the vnode representation of the component (e.g., the vnode representation of "/DIR1"). Then, the LFS searches for the next pathname component (i.e., "/DIR3") using the "vn.sub.-- lookup" operation of the vnode representing the previous component ("/DIR1) that was just found. If the search is again successful, the operation returns the vnode representation of this next pathname component (i.e., the vnode representation of "/DIR1/DIR3"). This main loop is repeated until the entire pathname is resolved (i.e., until the vnode representing the pathname "ROOT/DIR1/DIR3/DIR8/FILE10" is obtained). A similar process is used to resolve relative pathnames, except that is starts at the CWD. This process is similar to the traditional namespace resolution process used by the UNIX operating system, referred to as "namei" in the UNIX environment.
The cost (i.e., execution time) of performing the namespace resolution for the pathname operated on by a file interface function or procedure (e.g., an API function in the OS/400 operating system) can be modeled by the number of directory searches that occur during the namespace resolution. Thus, the cost of resolving the namespace resolution of the pathname "ROOT/DIR1/DIR3/DIR8/FILE10" in the above example may be modeled by the four searches required. The directory searches are the most expensive operation the resolution process performs repetitively for several reasons. First, the time required for the search can be adversely effected where there are many files or sub-directories in the directory being searched. The number of components in each directory being searched, however, depends upon the particular application. Second, the search for each component often requires the operating system to access the disk where the previous component is physically stored, and then search for the component on the disk. Disk accesses are relatively slow since the physical block or blocks on the disk where the component resides must be computed, and a sensing or read head must be moved there. The physical movement of the head to the position of the component takes a relatively long time in comparison to the typical operating speed of the computer. Thus, it is desirable to minimize the degree to which head movement of the disk drive is required to fully resolve a pathname by minimizing the number of searches required during the namespace resolution.
Caching can be used to minimize the number of searches required on the disk, thus minimizing the cost of namespace resolution. A cache is a relatively small amount of memory space allocated for storage of copies of frequently used data. When an instruction or data item stored on a disk has been cached, and the instruction or data item is accessed, a cache "hit" occurs, and the instruction or data item is read from the cache instead of from the slower disk, thereby increasing system speed. However, when the item being accessed is not available in the cache (i.e., a cache miss), the slower memory medium must still be accessed. In the ideal case, system speed could be increased by caching all of the instructions or data items which can be accessed. For example, by caching the entire on-disk representation of a file structure directory, namespace resolutions could be performed quickly since no disk accesses would be required. However, practical constraints, such as the relatively high cost of cache memory, limit the number of items which can be cached. For example, practical constraints may allow only a subset of the available pathname/filenames in a namespace to be cached (e.g., the pathname/filename "/DIR1/DIR3/DIR8/FILE9" may be cached, while the pathname/filename "/DIR1/DIR3/DIR8/FILE10" is not cached). Then, a cache hit would allow the entire pathname/filename to be fully resolved without the need to perform any searches of the disk (e.g., "/DIR1/DIR3/DIR8/FILE9" could be fully resolved without accessing the disk). However, the cost of namespace resolution would not be improved at all by a cache miss (e.g., "ROOT/DIR1/DIR3/DIR8/FILE10" would not be resolved by reference to the cache, and all four searches would still be required as if there was no cache). Traditional caching techniques attempt to maximize performance by caching the most frequently used items (e.g., for namespace resolution, traditional caching techniques would try to cache the most frequently used directories).
To reduce the effect on system performance of the limit on the size of cache memory, traditional caching techniques use statistical heuristics to replace cached items used relatively infrequently with new items expected to be used more frequently. Typically, the statistical heuristic technique monitors the usage of the cache items to create a usage profile, and then replaces the least recently used cache items. For example, if a namespace in a file system could be divided a priori into n sub-trees, with each sub-tree having a given expected usage, the performance of the data processing system could be improved by using statistical heuristics to update a cache used for namespace resolution. However, statistical heuristics does not provide good results in situations where the expected usage of the cache items is difficult to predict. For example, usage of the namespace in a file system usually varies dramatically depending on the particular application of the data processing system. Thus, traditional caching techniques cannot be applied, or may yield only relatively minor improvements, when used for resolving pathnames in file systems. Therefore, it would be desirable to provide a caching method or apparatus which can be efficiently applied to namespace resolution in a hierarchical file system. It would also be desirable to provide a caching method or apparatus which is independent of statistical assumptions such as expected usage. It would also be desirable to provide a caching method or apparatus providing some level of improved performance for every memory access despite having a limited cache size.