1. Technical Field
The present invention relates to data processing, and more particularly to random sampling of data items in a hierarchical data structure such as a "tree."
2. Description of the Related Art
Data items in a database are regularly sampled for statistical applications such as financial auditing, inventory control, and quality control. Although various sampling methods are used, the objective is to select a representative sample of the data items, so that conclusions about a large population of data items in the database can be estimated from a relatively small number of data items in the sample. See, for example, D. Montgomery, Introduction to Statistical Quality control, Wiley, 1985, pp. 23-55 and 351-429.
A widely used method of sampling is known as "random" sampling. Random sampling on a set of data items numbered 1 to K is defined as a selection of a subset of data items having numbers that are randomly scattered on the [1:K] interval. Preferably, the numbers are also uniformly distributed over the [1:K] interval. This special case is known as simple or unbiased random sampling, and it prevents any correlation of the sampled subset with any meaningful data characteristic.
In most database systems, data items are indexed by keys to permit retrieval of specified data items, but the data items are not assigned consecutive numbers. If consecutive numbers were initially assigned as keys, for example, then the deletion of a single data item, except the one assigned the highest key, would require re-assignment of some of the keys to preserve consecutive numbering. For complex data structures, re-assignment of consecutive numbers to the data items also interferes with activities such as compression, clustering, and multi-table intermix.
One kind of complex data structure in widespread use is a hierarchical data structure that looks like an inverted tree. For this reason it is known simply as a tree. The MS-DOS (Trademark) operating system for personal computers written by Microsoft Corporation, for example, uses a tree for indexing files stored in disk memory. The tree is known as a directory for a disk drive. The tree includes a main directory corresponding to the "root" of the tree, a subdirectory for each "branch" or intermediate node of the tree, and a file for each "leaf" of the tree. The root directory, each subdirectory, and each file is assigned a name. Although the file name is a "key" for a specified file, it does not uniquely specify the file, because the same file name can be assigned to different files in different directories. A file is accessed by searching a specified path in the tree, beginning with the root directory. Therefore a file is specified by a "path name" including the root name followed by any subdirectory names along the path, followed by the file name. A backslash ( ) symbol is used as a delimiter between the various names. See, for example, MS-DOS Version 3.3 Reference Guide, Compaq Computer Corporation (February 1988), pp. 2-1 to 2-16.
Trees are also widely used in relational database systems for sorted indexed retrieval of records. Each record is a row in a table, and a key is assigned to each record. Each record is also a leaf in a hierarchical index in the form of an inverted tree, but in this case each leaf in the tree has the same depth. In other words, each leaf is connected to the root by the same number of branches.
Sorted indexed retrieval is an efficient method of retrieving exact matches in response to the complete value of the key (corresponding to the "path name" described above combined with the key assigned to the row in table). Sorted indexed retrieval is also useful for finding out whether a row with a certain key value exists in a table. The search for a record having a specified complete key value is facilitated by maintaining the "branch names" in a sorted list in the root directory and each sub-directory of the tree.
To define a tree index for a database, a database administrator may issue a "CREATE INDEX" statement to the database system. In this "CREATE INDEX" statement, the administrator specifies a string of identifiers of attribute fields in each table, and in response the database system creates a tree having a level of branches corresponding to each attribute field, and at each level, the database system creates a branch corresponding to each distinct value in the attribute field appearing in the tables. The "CREATE INDEX" command may also indicate that the indexes of keys are to be maintained in a sorted condition.
Further details about sorted tree indexing are found in Hobbs & England, Rdb/VMS--A Comprehensive Guide, Digital Press, Digital Equipment Corporation, Maynard, Mass. (1991). Design considerations for database management systems using tree indexing are also described in Carey et al., "Object and File Management in the EXODUS Extensible Database System," Proceedings of the Twelfth International Conference on Very Large Data Bases, Koyoto, (August, 1986) pp. 91-100.
Current methods for random sampling of data items from trees are either highly biased or expensive. Some of these methods are compared and contrasted in Olken & Rotem, "Random Sampling from B.sup.+ trees," Proceedings of the Fifteenth International conference on Very Large Data Bases, Amsterdam, 1989, pp. 269-277.
An easy method of random sampling from a tree is to perform a random walk from the root to a leaf. In other words, while descending from the root, a branch is selected randomly at each level. This method, however, may deliver an extremely biased sampling, because the nodes at each level may have different numbers of branches, leading to preferred selection of paths having fewer branches. Experiments with indexes in the Rdb/VMS (Trademark) database system, for example, have shown that a small region of an index can contain a majority of the samples, making this easy method highly biased.
For a tree having all of its leaves at the same bottom level, it is known to eliminate bias by performing an acceptance/rejection test at each level during descent from the root so that each branch at each level has an equal chance of being selected, and therefore each leaf will have an equal chance of being selected. In other words, the sampling of a leaf will be simple and unbiased. Acceptance/rejection sampling in such a fashion, however, requires a probability of selecting any one branch at each node to be less than or equal to the probability of selecting the branch at the node having the maximum number of branches. The maximum number of branches in a tree is referred to as the "maximum fanout." In most systems the maximum fanout is either very large or limited only by the memory capacity of a single index page of the particular data processing system, but many nodes will have only a few branches. These conditions dictate that acceptance-rejection sampling in the above fashion requires a high average rate of rejection. In some typical Rdb/VMS (Trademark) database systems, rejection rates have varied from 99% for a large 913K index to 99.999,993% for a small 8K index, based on the maximum fanout actually found in these databases. Acceptance/rejection sampling in the above fashion therefore requires an inordinate amount of processing time and is impractical.
Another method of random sampling on trees provides simple or unbiased sampling, but at the expense of high overhead in both processing time and data structure maintenance. It is based on the conceptually simple method of identifying the number (K) of leaves, numbering the leaves 1 to K, and then randomly picking a number i within the range [1:K] to select the leaf associated with the number n. The simplicity of this method depends upon storing, for each node in the tree (except the leaves), the number r of leaves that are descendants of the node. Such a tree is known as a "ranked" tree.
In such a ranked tree, it is possible to select an ith leaf during a straight descent from the root by computing a running sum of the numbers of leaves that are descendants at each node along the descent, and selecting branches dependent upon a comparison of the number i to the running sum. To select a leaf at random, the number i is selected at random from the range [1:K], where K is the number of leaves. Although this method provides simple and unbiased sampling, it is impractical, because when a leaf is inserted or deleted, the cardinalities r along the entire path from the root to the leaf have to be updated.