1. Field of the Invention
The present invention relates to a data processing system with capabilities to recover its file systems, and also to a computer-readable medium storing a program designed therefor. More particularly, the present invention relates to a data processing system which can recover from system failures by using log records to restore the consistency of its file system structure, as well as to a computer-readable medium storing-a program providing such failure recovery functions.
2. Description of the Related Art
A computer system fails for various reasons, often introducing some inconsistencies in its file system structure. In the event of an abnormal shutdown, the computer system has to be rebooted, and the file system is entirely scanned to test whether any inconsistent entry has been produced. If any problem is found in this test, the computer system applies an appropriate modification to the file system in question, thereby restoring its consistency.
Scanning an entire file system, however, takes a long time, hampering a prompt failure recovery of the computer system. To reduce the down time, many of the modern computer operating systems (OS), such as UNIX OS, employ a certain mechanism to restore the file systems by using transaction logs. That is, any modifications or updates made to data in a computer file system are recorded in a log (or journal) file, and in case of a system failure, the file system would be restored through the process of scanning the log file and reapplying recorded updates to their destinations. The use of such a transaction logging mechanism reduces the system""s down time theoretically, but at the same time, it poses several technical challenges as described below.
Besides handling files themselves, the file systems have to manage what is called xe2x80x9cmetadata.xe2x80x9d The term xe2x80x9cmetadata,xe2x80x9d denoting xe2x80x9cdata about dataxe2x80x9d literally, refers herein to such data that describes the location, size, and other information about each file stored in a computer""s secondary storage unit. While metadata objects are also stored in a prescribed portion of a secondary storage unit, they are normally read out to the main memory of the computer system for the purpose of faster access and manipulation. In other words, metadata is cached on the computer""s main memory. Updated metadata objects are written back to the secondary storage unit at predetermined intervals, so that every modification made to the cached metadata will be reflected in their original entities in the secondary storage unit some time later. To ensure the successful recovery of file systems, it is mandatory for the logging system to save all recent records of such metadata modifications into its dedicated secondary storage subsystem before the cache contents are copied back to the secondary storage unit.
Some systems have a plurality of secondary storage units to provide for larger file systems. In such systems, a single file system operation may manipulate metadata objects managing multiple secondary storage units. To log this file system operation, conventional logging systems record every modification made on the metadata cache memory. However, the log records collected in this way would not serve satisfactorily, because it would take much time for the computer system to search the log records for relevant metadata objects stored in different secondary storage units. This means that the conventional logging systems are not effective in reducing the down time in such environments where metadata objects are distributed in multiple secondary storage units.
Another factor that delays the file system recovery is the time required for searching the entire log storage to find the oldest log record. This issue will be discussed below.
The logging system interacts with individual transactions which constitute a file system operation, and it collects records solely for such transactions that have committed, or successfully finished. To ensure this scheme, most file systems with a logging mechanism are configured to assign a sequence number to each transaction. When restoring such file systems, the logging system attempts to identify the oldest transaction on the basis of sequence numbers affixed to the stored log records. The logging system then starts a log replay from the identified point.
Log records should be saved in a dedicated secondary storage device, in preparation for possible system failures. While log records are produced endlessly, the storage for them is limited in size. This suggests that the logging system must reuse the limited storage resource in a cyclical manner, and to do so, it has to overwrite old records with new ones. In actuality, many of the stored log records are obsolete (i.e., not to be used to restore the file system), while the others are essential for file system recovery. Scanning the entire log storage to identify the oldest transaction means reading and testing many obsolete log records. This is obviously inefficient.
When searching for the oldest transaction, the system presupposes that the sequence numbers increase monotonously; they will never overflow or return to zero during logging operations. Typical logging systems prevent the sequence number from overflowing or wrapping around by reinitializing the log storage to zeros, when a file system restoration process is completed, or when it is detected that the sequence number will soon overflow. Such logging systems then resume their operation, restarting the sequence number from zero. However, it takes a long time to reinitialize the entire log storage, during which the computer systems are unable to provide their services. If they are working as servers, the interrupt of services would pose intolerable stress to their clients.
While the above three problems (1) to (3) relate to the restoration of a file system, the introduction of a logging system can even cause adverse effects to normal operations of the target computer systems. More specifically, there are several known techniques to realize high-speed access to secondary storage devices for logging purposes, which include log spooling on memory and sequential access optimized for specific disk structures. However, with those techniques alone, usable file recovery systems cannot be realized. Rather, to make such systems truly practical, it is necessary to develop more enhanced log collection and storage methodologies. Otherwise, computer systems would suffer from considerable penalties in throughput and storage efficiencies. The following will enumerate several specific issues that must be addressed.
It is often seen that a single transaction updates the same data object a number of times. The system may produce a log record each time an update occurs, but this log collection practice consumes more memory resources, as well as raises input/output traffic from/to the secondary storage unit for logging.
The logging system collects information on what updates have been made by individual transactions and records a set of such updates each time one transaction is completed, because the log must preserve the correct order of transaction executions. This generally means that no transactions can update a specific data object if it is being manipulated by another ongoing transaction. It may be relatively easy to implement this rule in the case of handling individual files; a plurality of transactions can proceed concurrently, while maintaining exclusive access to each file. However, the concurrent execution of transactions can be a challenge, when a plurality of transactions manipulate data controlling multiple files, such as a resource allocation map used to assign a storage space, etc.
Suppose, for example, that one transaction A was freeing up its allocated space, while another transaction B needed a free space, and as a result, the space freed by A has been reallocated to B. The logging system collects records from both transactions A and B and saves them to log storage when each transaction commits. Here, the resource allocation map, which represents the status of all storage blocks in bitmap form, is used to control allocation and deallocation of storage resources. In the present case, the log record of transaction A contains a bit indicating that the space is free, whereas the same bit in the log record of transaction B shows that the same space is in use. It is now assumed that the system has to restore the file system after an abnormal shutdown. This situation can be potentially problematic, depending on the timing of the system shutdown. Recall that, in the present example, transaction A released the space before transaction B gets it. However, if transaction B committed before transaction A, and if the system failed without writing the record of transaction A, then the resultant log file would include a record of resource allocation to transaction B, but nothing about the releasing operation done by transaction A. When used to restore the file system, this transaction log would bring about a conflicting situation where the storage space in question is allocated to both A and B, because there is no record showing that transaction A has released it.
Another pattern of system shutdown is such that the system crashes before saving the log of transaction B. This also causes an erroneous situation in the restored file system, where the storage space in question is not allocated to either of them, because the log of transaction A frees up the space that actually has been allocated to transaction B.
The both situations described above must be avoided. Although the problem may be solved by simply restricting the concurrent execution of multiple transactions, it will certainly pose a considerable penalty in the throughput of file systems running on a multi-task operating system.
As previously noted, it is a primary objective to provide a logging mechanism which recovers file systems in a short time. However, as a result of giving priority to this, not a few file systems ignore the independence between transactions, or behave as if they were healthy in spite of their imperfect recoverability. Conventional metadata management systems use a single memory area for caching log records collected from the entire file system. In such systems, a log record of one transaction may be confused with that of another transaction, lacking appropriate mechanisms to ensure the independence between transactions. The problem is serious particularly in handling of a resource allocation map as described in the previous item (5).
The lengths of resulting logs may be different from transaction to transaction. For example, a transaction that updates the timestamp of a file will only produce a tiny log record. In contrast, a transaction that creates a large data file will inevitably leave a long log record. Although a plurality of log buffers are provided to accommodate logs of different transactions, conventional logging systems do not care about the unevenness of log data sizes.
No matter how efficiently used, the main memory is limited in size. Naturally, the log cache memory created on the main memory is limited, and it is definitely smaller than the amount of log records to be produced by transactions.
The logging system and failure recovery mechanism should maintain a meaningful flow of operations, or operation semantics, when restoring file systems. This implies that every log record represents a consistent state of a file system sampled at the end of a transaction. Therefore, such a log record only containing a halfway history of a transaction would not work at all, because it fails to guarantee the operation semantics of the transaction.
System failures, if happened in the middle of a transaction, would create a critical situation for a file system. As previously stated, in a computer system having logging capabilities, the cache manager cannot force out the updated metadata objects until their corresponding log records are saved to the log volume. When the log cache memory is filled with collected records, it implies that the metadata cache is also highly loaded. The trouble is that the ongoing transactions cannot be finished without enough memory resources. They could hang if memory resources were exhausted.
The problem discussed in the previous item (10) also applies to the secondary storage for log files. During transactions, newly produced log records consumes this log storage capacity. However, the logging system cannot erase old log records unless their corresponding metadata cache entries are written back to their home locations. If one wishes to suppress the I/O traffic between the metadata cache and metadata storage, more records should be kept in the log storage. Valid log records can grow in this way. While even an average secondary storage device provides much larger capacity than cache memory does, it is still possible that many concurrent transactions would lead to exhaustion of log storage, in addition to shortage of metadata cache or log cache.
Taking the above into consideration, an object of the present invention is to provide a data processing system having file system recovery functions which work more efficiently.
To accomplish the above object, according to the present invention, there is provided a data processing system with a logging mechanism which stores log records for repairing an inconsistent file system. This system comprises the following elements:
(a) a primary storage subsystem;
(b) a secondary storage subsystem;
(c) a plurality of metadata volumes, created in the secondary storage subsystem, which store a plurality of metadata objects describing files;
(d) a log volume which is created in the secondary storage subsystem to store log records describing updates made to the metadata objects;
(e) a metadata cache which is created in the primary storage subsystem to temporarily store the metadata objects;
(f) a metadata loading unit which, in response to a transaction attempting to update metadata objects, loads the requested metadata objects from the metadata volumes to the metadata cache;
(g) a metadata manager which holds metadata volume identifiers associated with the metadata objects loaded to the metadata cache, where the metadata volume identifiers indicate in which of the metadata volumes the metadata objects were stored;
(h) a log collection unit which collects log records indicating what updates were made to the metadata objects in the metadata cache, where each log record contains the metadata volume identifiers corresponding to the updated metadata objects;
(i) a log buffer which stores the log records collected by the log collection unit; and
(j) a log writing unit which transfers the log records from the log buffer to the log volume.