A storage system is a computer that provides storage services relating to the organization of information on writeable persistent storage devices, such non-volatile memories, tapes or disks. The storage system typically includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of data containers, such as volumes and files on, e.g., the disks. Each “on-disk” file may be implemented as set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A volume, on the other hand, may comprise a collection of disks cooperating to define a logical arrangement of volume block number space for organizing the disk blocks.
The storage system may be further configured to operate according to a client/server model of information delivery to thereby allow many clients to access files stored on the system, e.g., a filer server or filer. In this model, the client may comprise an application, such as a database application, executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. Alternatively, the client may comprise one or more applications, such as database applications, executing directly on the storage system. Each client may request the services of the file system by issuing file system protocol messages to the storage operating system.
A common type of file system is a “write in-place” file system, an example of which is the conventional Berkeley fast file system. In a write in-place file system, the locations of the data structures on disk are typically fixed. That is, the disk is “viewed” as a large sequential array of blocks and changes (updates) to the data of a file stored in the blocks are made in-place, i.e., data is overwritten at the same disk locations. The write in-place file system may assume a layout such that the data is substantially contiguously arranged on disks. This disk layout results in efficient access operations, particularly for sequential read operations, directed to the disks. Updating of data in-place thus maintains efficient read access to the data, but often at the expense of write performance.
Note that any storage interconnected via, e.g., a storage area network by protocols such as Fibre Channel or direct attached storage interconnected by protocols such as IDE can be viewed as a “write in-place” file system with a very simple mapping of client logical data blocks to blocks in storage.
Another type of file system is a log-structured file system that does not overwrite data on disks. If a data block on disk is retrieved (read) from disk into memory of the storage system and “dirtied” or changed (updated) with new data provided by, e.g., an application, the data block is stored (written) to a new sequential location on disk to optimize write performance. Updates to the data of a file (hereinafter “updates”) may thus result in random, relocation of the blocks on disks. Over time and after many updates, the blocks of the file may become randomly scattered over the disks such that the file can become fragmented. This, in turn, causes sequential access operations, such as sequential read operations, of the file to randomly access the disks. Random access operations to a fragmented file are generally much slower than sequential access operations, thereby adversely impacting the overall performance of those operations. Note that the invention described herein is not limited to file systems or even to updates of files, but may apply to other types of storage systems and updates such as, updates to blocks.
An improvement to the log structured file system involves grouping of the updates into “batches”. Batching of updates improves the efficiency of access operations by sequentially writing long streams of updates, while also minimizing fragmentation. Batch updates may be implemented in a manner similar to page memory wherein, instead of allocating regions of memory address space, regions of disk storage space are allocated by the file system. Each region or “page” may comprise a predetermined amount of storage space. Thus, every time data is written to disk, the file system writes a “batch” of data, preferably sequentially, to an allocated page of disk space, and searches for another unallocated (“free”) page to which to write the next batch of data.
Often, the application (e.g., a database application) executing on the storage system has a requirement to store data temporarily, but in a persistent manner. In database terminology, a transaction is an arrangement to store changes to data or updates in a database (e.g., on a persistent storage subsystem) atomically, i.e., either all related updates issued by the application are stored or none are stored. When performing operations on behalf of the database application, the storage system typically executes a sequence of predefined tasks that transitions the system and its storage subsystem from one consistent state to another. This sequence is called a “consistent transaction”.
Because of the need to maintain transaction consistency, updates in such a storage system are not stored on the storage subsystem immediately, but are rather stored in a temporary, yet persistent, storage space (such as non-volatile memory or disk) of the system organized as a “log”. As used herein, the log is a record of updates used for backup and recovery of data served by a storage system, particularly in the presence of a failure in the system. Once the updates are stored in the log, where they can be recovered in light of the failure, they are moved to the persistent storage subsystem in a consistent manner. The temporary storage space of the log is persistent because the system complies with the consistency requirement that dictates that, for a transaction comprising a set of updates, either all or none of the updates are committed to the persistent storage subsystem.
For example, assume a transaction consists of a plurality of updates, each of which is processed independently by the storage system. Moreover, assume that a failure occurs after some of these updates (but not all) are committed to the persistent storage subsystem. As a result, the storage subsystem is left in an inconsistent state. Accordingly, the application initially records all updates to the log and, only when all updates are stored therein, transfers them to the storage subsystem. If a failure occurs during transfer to the persistent storage subsystem, the data can be recovered from the temporary persistent storage of the log to enable “rolling” backward or forward. That is, once all portions (i.e., individual updates of data) of the entire transaction are stored in the log, the application attempts to write the updated data to the persistent storage subsystem and, if a failure occurs, the data can be recovered from the log.
Note that the data in the log is self contained since it must be usable following any failure. Therefore, properties of the log are provided to retrieve the ordering of the updates to the log for a given transaction and to determine whether some updates that were supposed to be applied to the log are missing. These properties ensure the integrity of the data in the log.
A problem associated with the log adapted for use with a storage system involves emptying (i.e., releasing of the contents) of the log. The log typically comprises a plurality of entries, each of which stores an update. The updates are initially stored (recorded) on the log starting at a first entry and proceeding to a last entry, at which point the log “wraps around” to the first entry. As long as the entries have been released before they are needed again, new updates may be recorded to those released entries. The updates are typically recorded in entries of the log in the order in which they are received from one or sources, e.g., one or more applications. However, the updates may not be released, from the log in exactly that same order. This is typically the case if multiple updates to multiple different transactions were made at the same time. While they are stored to the log sequentially, the updates may be released at different times since the different transactions may be completed at different times. As a result, the log may become fragmented. Therefore, even if a large number of entries have been released such that the log is only partially full when wrapping around to the first entry, fragmentation of the log may inhibit the efficient recording of updates. The present invention is directed, in part, to solving this fragmentation problem.
An even more significant issue with fragmentation is the fact that once the log becomes full it “wraps around” to the first entry and continues adding entries from there. Once the log is fragmented, it is expected that sooner or later the log will reach an entry that cannot be written (because it has valid data). Therefore, while there may be a substantial amount of “free” entries, the log may not be able to utilize them. It should be noted that batching does not alleviate the fragmentation problem since updates are added to the batch based on temporal locality (when they were performed) but are released based on when a transaction is committed or aborted (which may differ from temporal locality).