1. Field of the Invention
The present invention relates to methods for sorting, storing and retrieving sorted data in computer systems. In particular, the present invention relates to sorting and storing data using sorting trees initialized in the volatile memory of a computer system.
2. Description of the Prior Art and Related Information
Handling large databases is a significant part of many applications of computer systems. For example, in a wide range of applications from financial services to retail operations and services, the handling of large databases in a efficient manner is a key requirement of the computer systems employed in these industries. Frequently, the databases of interest include a large number of separate data records, which data records need to be sorted in a desired order for efficient handling or searching. For example, such data records could include the pertinent information on employees in a corporation or account holders in a financial institution.
Such data records are typically stored in a high capacity nonvolatile storage medium such as disk drives associated with the computer system. As new data records are added, however, or upon initial creation of the database or database subset for storage, the sorting of the records into desired order is performed. This sorting is performed in the volatile working memory of the computer system which typically has a more limited capacity than the nonvolatile memory, which capacity may be needed for a variety of tasks other than the sorting of the data records.
Therefore, it is desirable to sort data records in the volatile memory of the computer system in as rapid and efficient manner as possible. It is further desired to minimize the amount of input/output (I/O) between the nonvolatile storage medium and the volatile memory due to the relatively slow nature of I/O operations relative to the operational speed of the computer system.
One highly efficient sorting technique which has been employed in the art is the so-called tournament sort. This approach is described, for example, in Knuth, Donald E., The Art of Computer Programming, Volume 3-Sorting and Searching, Section 5.4.1, pages 251-266, Addison-Wesley Publishing Company (1973). In this approach to sorting data records, a sort tree having a number of nodes configured in a hierarchical tree structure, is first created in the working memory of the computer system. Data records to be sorted are inserted into the bottom exterior nodes, or leaf nodes, of the sort tree, and the data records are compared up the tree in a tournament compare fashion until the xe2x80x9cwinnersxe2x80x9d emerge at the top of the tree in sorted order.
Prior to introducing the data records to be sorted into the sorting tree, however, the sorting tree first must be initialized. This initialization process involves introducing predetermined values into the tree structure which values will always win in any comparison with real data values. For example, such initialization values may take the form of negative infinity (xe2x88x92∞) or positive infinity (+∞), for ascending and descending sorts, respectively. These initialization values ensure that the real data records to be sorted move through the tree in the correct order. An initialization of the sort tree further requires that a xe2x80x9closer attributexe2x80x9d be determined for the interior nodes of the sorting tree. That is, since the initialization values loaded into the sorting tree all have the same nominal value, the initial losers and winners which move up the tree must be determined arbitrarily at the outset; that is, during initialization of the sort tree.
Although the tournament sort utilizing an initialized sort tree as described above theoretically has the desired characteristics of efficient and fast sorting, significant inefficiencies are encountered when the number of records to be sorted is not known in advance. For example, if it is desired that an unknown number of data records be sorted in a single sort, the largest sort tree which can be accommodated by the volatile memory of the computer system may be selected. Such a large sort tree will have considerable computer system time overhead associated with initializing the tree, however. Additionally, after initialization and before the first sorted data records are read out of the tree, all the initialization values must first be read out since such values are always xe2x80x9cwinnersxe2x80x9d relative to the real data records. Therefore, at least a corresponding number of comparison steps will be required to read out all the initialization values from the sort prior to getting actual sorted data records. Also, each subsequent data record sorted must be compared up the entire height of the tree, which height is log N, where N is the number of exterior nodes. If a relatively small set of data records is actually to be sorted, it will be appreciated that creation of a large sort tree involves a considerable amount of wasted computer time and uses an unnecessarily large part of the volatile memory.
If a relatively small sort tree is selected, equally small sets of data records to be sorted will be sorted in a close to optimal manner. However, sets of data records which exceed the sort tree size will encounter inefficiencies associated with performing the sort in two or more separate runs followed by merging sorts. More specifically, undesirable I/O overhead may be associated with reading and writing data records to and from main nonvolatile storage or scratch files during the separate runs through the sort tree. Also, initializing the small sort tree multiple times followed by one or more merge sorts will inevitably waste computer time as compared to a single sort.
Accordingly, it will be appreciated that the user of the computer system is faced with a xe2x80x9cCatch-22xe2x80x9d when undertaking a sort of an unknown number of data records. Choice of tree size which is either too large or too small will inevitably involve inefficiencies and wasted computer time which could otherwise be devoted to sorting. Such wasted time and inefficient use of working memory may be very significant where large databases are involved or where a large number of separate sorts are required.
Accordingly, it will be appreciated that a need exists for an improved method for sorting unknown quantities of database records. It will further be appreciated that such a method is needed which can optimize the use of available volatile memory and which can minimize the I/O overhead associated with transfers between nonvolatile and volatile memory.
The present invention provides a method for optimizing volatile memory usage and minimizing sort time in sorting unknown or variable numbers of database records.
In accordance with the present invention, data records to be sorted are read into volatile memory and data record identifiers including a sort key and a pointer to a specific volatile memory location, are created for each data record. A sort tree having interior and exterior nodes hierarchically arranged is then created in volatile memory and initialized in a predetermined ordered fashion. The nominal sort tree size may be selected by the user or be predetermined, e.g. as the maximum size sort tree compatible with the constraints of the available volatile memory space. Then, data record identifiers, including the key and pointer, are introduced into the tree in an order which moves across the exterior nodes of the tree rather than randomly populating the exterior nodes. The sort tree is dynamically altered, during or after introduction of the data record identifiers into the tree, to optimize the effective size of the sort tree. After the data record identifiers have all been input and the tree is dynamically reconfigured, the sort proceeds, with the keys being compared up the tree and the keys and pointers shifted in volatile memory into the sorted order. The sorted pointers are then used to read the data records from volatile memory back into volatile memory in sorted order.
Since the sort tree is dynamically reconfigured to an optimized effective size, selecting the maximum nominal size of the sort tree has the advantage of minimizing the number of times which the sort tree will need to be initialized as well as minimizing inefficiencies attendant to performing sorts on separate runs and merging the results of those runs. In addition, I/O overhead may be reduced by minimizing the number of times that data must be read and written from nonvolatile memory during the separate runs through separate sort trees.
In a preferred embodiment, the sort tree is dynamically reconfigured as it is created as data record identifiers are read in. That is, the sort tree is grown as necessary to accommodate data record identifiers introduced into the nascent tree. The sort tree employs a movable root node which is always set as low as possible in the sort tree. The root node is moved upwards as needed when data records are added. After the dynamically created and initialized sort tree is completed and all data record identifiers have been loaded, the data record key values are sorted using a compare rule in which a key value at a lower level in the sort tree hierarchy will leapfrog key values of equal value when they are compared.
In an alternative embodiment, a sort tree is completely initialized and data record identifiers are then read into the exterior nodes of the sort tree in the above-described ordered manner. Once all data values have been loaded, the sort tree is dynamically reduced to a more optimal size. One preferred reducing operation is to dynamically truncate, or xe2x80x9cprune,xe2x80x9d the tree by eliminating unused exterior nodes and corresponding interior nodes. Data sorting may then proceed in the reduced tree using the above-noted compare rule. In an alternative embodiment all unused nodes are changed to a value corresponding to a predetermined loser value; i.e. a value which will lose all compares. Those nodes associated with dynamically changed loser values then become a dormant background of the sort tree since these values do not advance during compares. This effectively reduces the size of the sort tree. This approach may be combined with the pruning approach where sort consistency considerations prevent pruning all unused nodes from the tree. By reducing the size of the sort tree after initialization, the number of compares required to eliminate initialization values and to remove sorted data identifiers from the tree, is reduced. Therefore the present invention provides a method for sorting and storing database records using volatile and nonvolatile memory in which the size of the sort tree may be effectively reduced automatically.
Further features and advantages of the present invention will be appreciated from review of the following detailed description of the invention.