1. Field of the Invention
The present invention is generally related to high-performance computer filesystem designs used in conjunction with contemporary operating systems and, in particular, to a multi-tasking computer system employing a log device to support a log structured filesystem paradigm over an independent filesystem and the operation of the log device to dynamically balance filesystem I/O transactions.
2. Description of the Related Art
In the operation of conventional computer systems, the overall performance of the system is often constrained by the practically achievable throughput rates of secondary mass storage units, typically implemented variously utilizing single disk drives and cooperatively organized arrays of disk drives. As the peak performance of central processing units has dramatically increased, performance constraints have significantly increased due to the relatively lesser advances in performance achievable by secondary mass storage units. Factors that affect the performance of disk drive type devices include, in particular, the inherent mechanical operation and geometric relations imposed by the fundamental mechanical construction and operation of conventional disk drives. The essentially sequential operating nature of disk drives and the extremely disparate rates of data read/write, actuator seek rates and rotational latencies result in the performance of secondary mass storage devices being highly dependant on the layout and logical organization of data on the physical data storage surfaces of disk drives.
Due to inherent asymmetries in performing read and write disk drive data storage operations, particularly to ensure that physical storage space is correctly allocated and subsequently referenced, a substantial tension exists between optimization of the data layout for data reads and writes. Typically, available physical storage space must be determined from file allocation tables, written to, and then and cataloged in directory entries to perform even basic data writes. A sequence of physical data and directory reads are all that is typically required for data reads.
Another factor that can significantly influence the optimum layout of data is the nature of the software applications executed by the central processing unit at any given time. Different optimizations can be effectively employed depending on whether there is a preponderance of data reads as compared to data writes, whether large or small data block transfers are being performed, and whether physical disk accesses are highly random or substantially sequential. However, the mix of concurrently executing applications in most computer systems is difficult if not practically impossible to manage purely to enforce disk drive operation optimizations. Conventionally, the various trade-offs between different optimizations are statically established when defining the basic parameters of a filesystem layout. Although some filesystem parameters may be changeable without re-installing the filesystem, fundamental filesystem control parameters are not changeable, and certainly not dynamically tunable during active filesystem operation.
An early effort to improve the performance of secondary mass storage devices involved providing a buffer cache within the primary memory of the computer system. Conventional buffer caches are logically established in the file read/write data stream. Repeated file accesses and random accesses of a particular file close in time establish initial image copies of the file contents within the buffer cache. The subsequent references, either for reading or writing, are executed directly against the buffer cache with any file write accesses to the mass storage subsystem delayed subject to a periodic flushing of write data from the buffer cache to secondary mass storage. The buffer cache thus enables many file read and write operations to complete at the speed of main memory accesses while tending to average down the peak access frequency of the physical secondary mass storage devices.
A significant drawback of merely using a buffer cache to improve overall system performance arises in circumstances where data integrity requirements require write data to be written to a non-volatile store before the write access can be deemed complete. In many networked computer system applications, particularly where connectionless communication protocols are utilized for file data transport over the network, the requirement that file write data accesses be completed to non-volatile store is a fundamental requirement of the network protocol itself. Thus, conventionally, the file access latencies incurred in writing data to secondary mass storage devices are a component of and compounded by the latencies associated with data transport over both local and wide area networks.
One approach to minimizing the performance impact of non-volatile storage write requirements has been to establish at least a portion of the buffer cache utilizing non-volatile RAM memory devices. Write data transferred to the buffer cache intended for storage by the secondary mass storage device is preferentially stored in the non-volatile RAM portion of the buffer cache. Once so stored, the file write data request can then be immediately confirmed as succeeding in writing the file data to a non-volatile store.
There are a number of rather significant complexities in utilizing non-volatile RAM buffer caches. The write and read file data streams are typically separated so as to optimize the use of the non-volatile RAM memory for storing write data only. Also, substantial complexities exist under failure conditions where write file data in the non-volatile RAM cache must be cleared to secondary mass storage without reliance on any other information or data beyond what has been preserved in the non-volatile RAM. Even with these complexities, which all must be comprehensively and concurrently handled, the use of a non-volatile RAM store does succeed in again reducing file write access latency to essentially that of non-volatile RAM access speeds.
One particular and practical drawback to the use of non-volatile RAM caches is the rather substantial increased cost and necessarily concomitant limited size of the non-volatile write cache. The establishment of a non-volatile RAM cache either through the use of flash memory chips or conventional static RAM memory subsystems supported with a non-interruptible power supply is relatively expensive as compared to the cost of ordinary dynamic RAM memory. Furthermore, the additional power requirements and physical size of a non-volatile RAM memory unit may present somewhat less significant but nonetheless practical constraints on the total size of the file write non-volatile RAM cache. Consequently, circumstances may exist where the non-volatile write cache, due to its limited size, saturates with file write data requests resulting in degraded response times that is potentially even slower than simply writing file data directly to the secondary mass storage devices.
In order to alleviate some of the limitations of non-volatile RAM caches, disk caches have been proposed. The use of a disk drive for the non-volatile storage of write cached data is significantly more cost effective and capable of supporting substantially larger cache memory sizes than can be realized through the use of non-volatile RAM memories. Although by definition a non-volatile store and capable of being scaled to rather large capacities, disk caches again have file write access times that are several if not many orders of magnitude slower than conventional main memory accesses. Consequently, disk caches are selectively constructed utilizing exceedingly high performance disk drives, resulting in a typically modest improvement in file write access times, but again at significantly increased cost.
In addition to the practical issues associated with using a disk drive as a cache memory, logical data management problems are also encountered. Preferably, the file read data stream is logically routed around the disk cache and supported exclusively through the operation of the RAM buffer cache. File write data bypasses the main memory buffer cache and is written exclusively to the disk cache. Particularly in multiuser a networked computer system environments, multiple independent read and write file accesses may be directed against a single file within a rather small time frame. Since the requests are ultimately associated with potentially independent processes and applications, the computer operating system or at least the subsystem managing the buffer and disk caches must provide a mechanism for preserving data integrity. Write data requests must be continually resolved against prior writes of the same block of file data as stored by the disk cache. Each read of a file data block must be evaluated against all of the data blocks held by the write disk cache. While many different bypass mechanisms and data integrity management algorithms have been developed, the fundamental limitation of a disk cache remains. Repeated accesses to the disk cache are required not only in the ordinary transfer of write file data to the cache but also in management of the cache structure and in continually maintaining the current integrity of the write file data stream. Consequently, the potential performance improvements achievable by a disk cache are further limited in practice.
Significant work has been done in developing new and modified filesystems that tend to permit the optimal use of the available mass storage subsystem read and write data bandwidth. In connection with many conventional filesystems, a substantial portion of the available data access bandwidth of a disk drive based mass storage subsystem is consumed in seeking operations between data directories and potentially fragmented parts of data files. The actual drive bandwidth available for writing new data to the mass storage subsystem can be as low as five to ten percent of the total drive bandwidth. Early approaches to improving write data efficiency include pre-ordering or reordering of seek and write operations to reduce the effective seek length necessary to write a current portion of write stream data. Further optimizations actually encourage the writing of data anywhere on the disk drive recording surfaces consistent with the current position of the write head and the availability of write data space. Directory entries are then scheduled for later update consistent with the minimum seek algorithms of earlier filesystems.
In all of these optimized conventional filesystems, a substantial portion of the disk drive bandwidth is still consumed with seeking operations. Hybrid filesystems have been proposed to further improve bandwidth utilization. These hybrid filesystems typically include a sequential log created as an integral part of the filesystem. The log file is sequentially appended to with all writes of file data and directory information. In writing data to the log structure, only an initial seek is required, if at all, before a continuous sequential transfer of data can be made. Data bandwidth, at least during log write operations, is greatly improved. Whenever the log fills or excess data bandwidth becomes available, the log contents are parsed and transferred to the filesystem proper.
A logging filesystem does not reduce, but likely increases the total number of file seeks that must be performed by the disk drive subsystem. The log itself must be managed to invalidate logically overwritten data blocks and to merge together the product of partial overwrites. In addition, file data reads must continually evaluate the log itself to determine whether more current data resides in the log or the main portion of the filesystem. Consequently, while atomic or block file data writes may be performed quickly with a minimum of seeking, there may actually be a decrease in the overall data transfer bandwidth available from the disk drive due to the new cleaning and increased maintenance operations inherently required to support logging. For these reasons, hybrid logging filesystems, like the earlier disk caches, have not been generally accepted as a cost effective way of improving the overall performance of mass storage subsystems.
A relatively new filesystem architecture, often referred to a log structured filesystem has been proposed and generally implemented as a mechanism for significantly improving the effective data bandwidth available from a disk drive based mass storage subsystem. A log structured filesystem provides for permanent recording of write file data in an effectively continuous sequential log. Since data is intentionally written as received continually to the end of the active log, the effective write data bandwidth rises to approximately that of the data bandwidth of the disk drive mass storage subsystem. All seek operations are minimized as file data is written to the end of the active log. Read data, as well as cleaning and data block maintenance operations, are the main source of seek operations.
Log structured filesystems are generally viewed as particularly cost effective in being able to use the entire drive storage space provided by the mass storage subsystem for the log structure and obviating any possible benefit to using an external disk cache. Unlike the hybrid log filesystems, the log structured filesystem is itself the ultimate destination for all write file data. Since the effective write performance of the resulting log structured filesystem is quite high, there is no benefit for pre-caching write data on one disk drive and then copying the data to another disk drive.
The general acceptance of log structured filesystems for certain, typically write intensive, computer applications reflects the significant improvement available through the use of a direct sequential write log structure filesystem. The available write data bandwidth, even in the presence of continuing log cleaning and maintenance operations, can be near or above 70 percent. In addition, log structured filesystems provides a number of ancillary benefits involving the reduced latency of atomic file data write operations and improved data integrity verification following computer system crashes. Particularly in network support related operations, the direct writing of write file data to a log structured filesystem, including directory related information as an essentially atomic operation minimizes the total latency seen in completing committed write network data write transfer operations. Similarly, by virtue of all write file data operations being focused at the end of the active log, as opposed to being scattered throughout the disk drive storage space, data verification operations need only be focused on evaluating just the end of the log end rather than the entire directory and data structures of a conventional filesystem. Consequently, both network data write operations and the integrity of all data written is improved by writing directly to a permanent log structured filesystem.
Log structured filesystems are, however, not entirely effective in all computing environments. For example, log structured filesystems show little improvement over conventional filesystems where the computing environment is subject to a large percentage of fragmentary data writes and sequential data reads such as may occur frequently in transactional data base applications. The write data optimizations provided by log structured filesystems can also be rather inefficient in a variety of other circumstances as well, such as where random and small data block read accesses are dominant. Indeed, as computer systems continue to grow in power and are required to support more and different application environments concurrently with respect to a common mass storage subsystem, the tension between applications for optimal use of the disk drive bandwidth provided by the mass storage system will only continue to increase.
Therefore, a substantial need now exists for a new filesystem architecture that is optimized, including during ongoing operation, for both read and write accesses concurrent with processes for ensuring data integrity and fast crash recovery, and the many practical issues involved in providing and managing a high performance filesystem.