This invention relates in general to data storage systems and, more particularly, to transaction log management for disk array storage systems and techniques for recovering transaction logs.
Computer systems are constantly improving in terms of speed, reliability, and processing capability. As a result, computers are able to handle more complex and sophisticated applications. As computers improve, performance demands placed on mass storage and input/output (I/O) devices increase. Thus, there is a continuing need to design mass storage systems that keep pace in terms of performance with evolving computer systems.
This invention particularly concerns mass storage systems of the disk array type. Disk array data storage systems have multiple storage disk drive devices, which are arranged and coordinated to form a single mass storage system. There are three primary design criteria for mass storage systems: cost, performance, and availability. It is most desirable to produce memory devices that have a low cost per megabyte, a high input/output performance, and high data availability. xe2x80x9cAvailabilityxe2x80x9d is the ability to access data stored in the storage system and the ability to insure continued operation in the event of some failure. Typically, data availability is provided through the use of redundancy wherein data, or relationships among data, are stored in multiple locations. Two common methods of storing redundant data are the xe2x80x9cmirrorxe2x80x9d and xe2x80x9cparityxe2x80x9d methods.
One problem encountered in the design of disk array data storage systems concerns the issue of retaining accurate mapping information of the data in store in the event of a system error or failure. This is true for systems that employ either one or both methods of storing redundant data. Thus, in the course of managing disk array mapping information, it is often necessary to insure that recently changed mapping information is stored on disk for error recovery purposes. This disk write requirement may occur for several reasons, such as (i) a time based frequency status update, (ii) a log page full status, or (iii) a specific host request.
Generally, recent changes are accumulated at random locations in data structures that are optimized for performance of the disk array function and, in addition, are accumulated sequentially in a log which can be written to disk (posted) more quickly than the other data structures. This technique is common in the art of transaction processing. Disadvantageously, however, the posting requirement may occur concurrently with other ongoing disk read or write activity thereby creating I/O contention in the system. Such I/O contention often extracts a significant performance hit on the system, especially if the posting occurs frequently, because multiple I/O events must occur for a single posting of the log to disk. For example, typically, the log page is first marked as invalid (i.e., it needs to be updated). Then, the log page is copied to disk and subsequently marked valid. Finally, in a redundant system, the redundant log page is copied to disk.
In view of the forgoing, and of the ever increasing computing speeds offered and massive amounts of information being managed, there is a constant need for improved performance in disk array systems and particularly in the recovery of such disk array systems.
This invention concerns transaction logging for a data storage system and methods for recovering log records following a system failure. The storage system has a main memory to hold a log image. The log image consists of multiple log records, with each log record being assigned a monotonically increasing sequence number that tracks the order in which the log records are written to the log image. The sequence numbers provide an indication of how recently the log records are written to the log image.
The storage system has multiple storage media (e.g., disks) connected to the main memory. The storage media have a reserved area made up of at least two staging buffers on each medium. In the described implementation, there is one even and one odd staging buffer on each storage medium.
The log image consists of log records kept in a page log and in a distributed log. The page log is stored on the storage media and holds entire pages of log records from the log image. As a page in the log image is filled with log records, the page is flushed to the page log. The distributed log is distributed over the storage media and resides in the staging buffers. In contrast to the page log, the distributed log contains incremental log records that are occasionally forced to the storage media prior to filling an entire page of log records. The incremental log records are written to a least busy storage medium in an alternating pattern between the two staging buffers. The distributed log typically includes log records that have been more recently written than the log records contained in the page log.
The storage system has a log recovery manager that recovers the log image following a failure. The log recovery manager first reads the log records from the page log. This reproduces a majority of the log image. The log recovery manager then attempts to full restore the log image by scanning the distributed log to locate any more recent log records that may exist. Once a more recent log record is found, the log recovery manager adds it to the recaptured log image and then proceeds to find even more recent log records.
To speed recovery, the log recovery manager intelligently begins the search at a location in the storage system that the next log record is likely to reside. More particularly, the log recovery manager begins looking for the next log record using three criteria: (1) it looks on the same storage medium that contains the previous log record just found; (2) it looks in the other staging buffer on the storage medium rather than the buffer containing the log record just found; and (3) it begins at an offset equal to the length of the log record just found. These three criteria significantly improve the recovery time.