1. Field of the Invention
The present invention relates to techniques for performing caching in a file system.
2. Related Art
In a typical file system, approximately 50 percent of I/O operations access less than 5 percent of the data which is accessed. Hence, if 5 percent of the most-frequently used data is cached in memory, at least half of the I/O activity can be made as fast as memory accesses.
Data caching eliminates I/O operations by caching the most-recently/most-frequently accessed data in memory and by facilitating prefetching for data that will likely be accessed in the near future. For example, pages which are allocated for database redo and rollback operations as well as log indexes and database tables will most likely be used many times. Similarly, pages that are allocated for executables or shared libraries are likely to be used frequently by many applications. In contrast, during a file copying operation, the allocated pages are most likely to be used only once regardless of whether a write operation or a read operation was performed on the page.
When caching data for a file system, it is often difficult to determine which pages should be cached because access patterns for applications can differ greatly. One solution to this problem is to use hidden Markov Models and Markov Chains to predict access patterns for files. Unfortunately, the complexity of these models makes them impractical to use.
Another solution is to examine how data is organized on the file system to predict how the data will be accessed by an application. Typically, data segments for files are well organized and fall into several categories:                (1) sequential data segments are stored sequentially from the beginning to the end. There are no gaps between sequential data segments into which data segments from other files are inserted. Examples of sequential data include digitized videos and voicemail;        (2) strided data segments are stored such that each segment is located a specified stride from the previous segment. The stride can be a fixed length or a variable-length;        (3) mapped data segments are logically mapped using a table or a table of tables;        (4) linked data segments are logically linked to each other using references. For example, a previously-read data segment can be used to determine the offset of the next data segment; and        (5) random data segments can appear in any order in the file system. For example, while attempting to identify a prime number larger than any previously-determined prime number, the prime number candidates are divided by each of the previously-determined prime numbers. The order of the known prime numbers does not matter. Hence, the data segments for these prime numbers can be stored randomly on the file system.        
Data is typically organized using one (or a combination) of the five techniques described above. As previously mentioned, the organization of data on the file system can indicate how the data is accessed by an application. Hence, if the data is arranged sequentially, then the access pattern will most likely be sequential. Similarly, if the data is strided, then the access pattern will most likely be strided. Furthermore, if the data is logically mapped or linked, then the access pattern will most likely be random. Unfortunately, the organization of the data for a file may not reflect the actual data access pattern.
Operating systems, such as Solaris™, typically use free memory as a system-wide cache. (Note that Solaris is a trademark or registered trademark of Sun Microsystems, Inc. in the U.S. or other countries.) However, some applications cannot take full advantage of this type of cache because their data access patterns do not match the caching policy implemented by kernel of the operation system (OS).
Solaris uses free pages in memory to cache file system data. This makes I/O operations as fast as memory accesses as long as the data that the application wants to access is located in the cache list (i.e., using page_lookup ( )). When caching a page to memory, Solaris applies one of two policies described below to all the applications system wide. Furthermore, Solaris can only choose one policy at a time.
FIG. 2A presents a block diagram illustrating a first caching policy (which corresponds to the caching policy used by Solaris when the parameter freebehind is set to 1). When an application requests a page of a file from the file system, the OS attempts to allocate a page from free page list 202 to the page requested from the file system. If there are no free pages in free page list 202, then a page from the head of cache list 204 is allocated to the page. Note that a request to access the file system can either be a read operation or a write operation.
After the application releases a page, if the page is involved in a write operation, the OS puts the page involved at the tail of cache list 204. In contrast, if the page is involved in a read operation, the OS puts the page involved at the head of cache list 204. Note that a page can also be released from cache list 204 when the page is copied to the application address space 206.
FIG. 2B presents a block diagram illustrating another caching policy. (Note that this caching policy corresponds to the caching policy in Solaris when the parameter freebehind is set to 0.) In this caching policy, all pages are retained regardless of whether they are being read from or being written to. The pages stay in the address space of the application until they are stolen for other purposes (e.g., when the kernel is out of memory). For example, after a page is allocated from cache list 202 or free page list 204 into application address space 208, a second application can “steal” pages used by the first application. These stolen pages are then allocated to another application address space 210.
When freebehind is set to 1, the life cycle of a page used in a read operation is very short because free pages are allocated from the head of the cache list. Note that applications typically experience different performance results depending on whether freebehind is set to 1 or is set to 0. In order to cache as much data as possible, freebehind is set to 0, but this can cause memory to be used up quickly.
FIG. 3 presents a block diagram illustrating a caching architecture which includes applications 302-306 and vnodes 308-312, wherein a vnode is a reference to a file on the storage device. When an application accesses a page of a file, cache list 315 is first checked to determine whether the page exists in cache list 315. If so, the page is copied to the address space of the application and the page is released from cache list 315. If the application performs a read operation on the page, then the released page is placed at the head of cache list 315. If the application performs a write operation on the page, then the released page is placed at the tail of cache list 315.
If the page requested by the application is not in cache list 315 or in the address space for the application, device driver 316 sends an I/O request to the storage device to retrieve the page from the storage device. After the page is received from the storage device, if there are free pages in the free page list, a page from the free page list is allocated to the retrieved page. Otherwise, a page from the head of cache list 315 is allocated to the retrieved page.
When the system is about to run out of physical memory, a paging daemon uses a “two-handed scanner” to scan all the physical pages, and to “steal” pages that have not been accessed/modified recently. A two-handed scanner is analogous to a two-handed clock, wherein the physical memory is represented by the face of the clock and the two hands of the scanner are set a specified distance apart in time. The leading hand clears the “reference” flag for a page. The trailing hand checks the page at a later time to determine whether the page has been referenced after the leading hand cleared the reference flag. If not, the page becomes a candidate to be freed. Hence, a page is paged out only if the page has not been referenced recently. Unfortunately, the page scanner has no information about actual access patterns. The only information it can use is the reference flag for the page regardless how these pages were used in the past.
The two-handed scanner can scan the physical page list quickly. However, it can also invalidate pages which may be reused shortly thereafter. For example, when the system is under high memory pressure, kernel level activity can be delayed. As a result page references and page modifications may not occur during this period. Instead, page references may be delayed until after the high-memory activity period. Hence, the two-handed scanner can incorrectly free pages which may be used after the high-memory activity period.
Moreover, Solaris (and other UNIX file systems) assume that the subsequent read pattern is sequential if two consecutive data blocks read from the storage device are adjacent to each other (i.e., sequential). If this pattern changes, the OS only reads one page at a time and does not prefetch pages. There are several situations that can break this pattern. For example, a file may be distributed across several threads in a process, wherein each thread of the process appears to be accessing the file locally in strided manner. However, the interleaved access stream is globally sequential. In another example, a file is shared by multiple applications, so the access pattern may be difficult to determine.
Hence, what is needed is a method and an apparatus for performing caching in a file system without the problems described above.