The invention is generally directed to fault tolerance and automated recovery of a computer in response to an abnormal termination, and in particular, to automated recovery of a queue implemented in the same.
As the world becomes increasingly reliant upon the computer retrieval and processing of data, the integrity of the systems safeguarding that data has become paramount. Computers are used in many mission-critical applications where interruption or failure of a computer is intolerable. Given the pervasive dependency of society on computers, any interruptions, corruptions or crashes caused by power failures, faulty operations or programming errors can have devastating implications. From world financial markets to the transportation industry, the seamless recovery of processes from potentially crippling terminations is critical. For these reasons, considerable efforts have been devoted to ensuring the preservation and recoverability of vital, computer-stored information.
One area of particular criticality pertains to the reliable maintenance and recovery of computer queues. Generally, a queue is a type of first-in, first-out (FIFO) data structure for holding multiple elements of information that supports two primary operations, an enqueue operation for adding an element to the queue, and a dequeue operation for removing the oldest element (i.e., the element that was placed on the queue before any other elements currently in the queue) from the queue.
One application of a queue, for example, is in storing an ordered list of messages to be communicated between computer processes or jobs executing on a computer. Often message queues are incorporated into low-level communication mechanisms to facilitate communications between concurrently executing computer processes. In such applications, any process that wishes to convey information to another process generates an appropriate message and invokes the communication mechanism to enqueue the message on a message queue that is accessible by the recipient process for the message. Then, the recipient process is permitted to receive the message by invoking the communication mechanism to dequeue the message from the message queue. Given the FIFO nature of a queue, multiple messages may be added to the queue, and will be maintained in their proper sequence until the recipient process is able to remove all of the messages from the queue.
The messages stored in a queue are often stored in a linked-list data structure, where each message has a pointer to the next message in the queue, so that a particular message in a queue may be located by following the chain of pointers between the messages in the linked-list until the desired message is found. Following a long chain of pointers, however, can be time consuming, and as such, in some performance-critical applications, keyed data structures such as trees are used to improve performance. With a tree data structure for a queue, messages are associated with keys, and sorted hierarchically based on their keys. The keys are used to traverse through the tree to locate a specific message meeting a specific condition. Multiple paths are defined through the data structure so that less steps are typically required to traverse a tree to locate a specific message. To optimize the performance of a tree, often the tree is balanced, such that messages are resorted in the data structure to minimize the distance from the root of the tree to each leaf Often, when pointers are used to connect the different messages in a tree, such reordering requires only that the pointers between messages be updated.
Recovery of message queues is often desirable to permit a computer to recover from an abnormal termination, in particular, to ensure that no messages that were stored in the queue are lost prior to delivery to their intended destinations. However, message queues and the like present a number of unique maintenance and recovery problems that are not adequately addressed by conventional computer recovery techniques.
For example, some computer designs have attempted to store back-up files of nearly every computer message and operation executed on a particular operating system. However, doing so often exceeds the storage capacity of a computer and places an immense burden on the processing capability of the computer, resulting in severe performance degradation. Other designs reduce such overhead by storing only key diagnostic parameters to a journal for the purpose of recovering memory such as database information. A journal in such an application typically includes a record of changes that have been made to particular segments of memory since those segments were last written to nonvolatile storage (e.g., an external direct access storage device (DASD)). The contents of a journal are also typically saved in nonvolatile storage, such that, upon failure, a segment of memory can be recovered by retrieving the last copy of the segment from nonvolatile storage and applying the relevant journal entries to in effect regenerate the changes that occurred to the segment.
While such methods often provide sufficient safeguards for certain types of computer information, such methods are typically inadequate for use in protecting queued message data. In particular, queue operations often entail frequent pointer manipulation operations, and any attempts to record or journal all of such operations would impose significant storage and processing overhead.
Therefore, a significant need exists for an improved manner of maintaining computer queues, in particular so as to facilitate the automatic recovery of corrupted queues in fault tolerant and other mission-critical applications.
The invention addresses these and other problems associated with the prior art in providing an apparatus, program product and method in which a queue is maintained and recovered utilizing element-based journaling to record changes made to logical elements in a queue. Consequently, in contrast to conventional memory-based journaling, where any changes to the memory representing an element in a queue would be journaled, only those operations that affect the logical ordering and/or placement of an element on a queue, or the logical contents of such an element need be journaled. Often, memory management operations such as pointer manipulation operations that modify pointers, but do not otherwise modify the actual elements in a queue or their relative ordering, need not be journaled. As a consequence, the storage and processing overhead associated with journaling maybe substantially reduced, thereby substantially reducing the overhead associated with maintenance and recovery of a queue.
Therefore, the invention provides in one aspect a method, apparatus and program product for managing a queue in a computer, in which a persistent representation of a queue is stored in nonvolatile memory, at least one logical operation performed on the queue is journaled subsequent to storing the persistent representation of the queue, and the queue is recovered by retrieving the persistent representation of the queue from nonvolatile memory and applying the journaled logical operation to the queue.