A database generally comprises a plurality of records for storing information on some topic. For example, a database may have records for each city in the United States. Each record of a database generally comprises a predetermined number of fields. In general, each record has the same distribution of fields as other records. Each field contains information on a particular aspect of the objects of the database. For example, the database of cities in the United States may have a field containing the population of each city. Additionally, the database may have a field for indicating the state which contains the city in question. Some fields in a database may have values that range in a continuum of values, e.g., the population of the city. Other fields in a database may have values which range in a set of predetermined values, e.g., the state in which the city is located.
One common form for representing the records in a database is the relational database format. The relational database comprises a table having rows and columns. Each row represent a particular record, and each column row represents a data field. All rows share the same columns and type of information. The above example database of cities may be formulated into a relational database. In such a database, each row might represent a city and the relational database would have a row for every city of interest. The relational database might include columns for storing the name of the city, the county in which it is located, the state in which the county is located, and the population of the city.
Data is added to the database by adding rows at the end of the database to represent added records. Information is extracted from the database in the form of queries. A query typically consists of a set of requirements which the data in one or more of the columns must satisfy for the record to be returned. A simple query takes the form of finding all rows in the relational table which have the same value in a given field, e.g., all records having the county field equal to a particular value. More complex queries to the database involve examining a plurality of columns, e.g., all rows in which the county has a particular value and the city has a particular value. For example, one might wish to view the population data for all cities in a particular county of a particular state.
If the number of records in the database is N, N examinations will have to be performed. While this is a relatively simple task for a small database, the examination of each record in the database for a large database constitutes a great expense of time. First, the search time for a simple query is related to N numbers of records in a database. Second, large databases often exceed the memory capacity of the computer and require the storage of records on some secondary medium such as disks and tapes. The access time for secondary medium are often orders of magnitude greater than access times in computer memory. The use of secondary storage in this manner leads a dramatic increase in time for simple queries.
One prior art method for reducing search times in such large databases is to optimize the order of the records in the database for one particular type of query. The optimization is accomplished by ordering the rows of the database according to the value of some variable calculated from the query. For example, suppose one often wishes to search the database for the data on a particular city by giving the name of the city. The rows would then be ordered based on the alphabetical order of the city names. A particular name could then be found by performing a binary search on the names in city column.
Consider a database having N records. Without ordering the records, the maximum time needed to search the database for a record having a particular value in a particular column is proportional to N. If the records are known to ordered, maximum number of records that need to be searched is reduced to approximately Log.sub.2 N. If N were approximately one million, the ordering of the database would reduce the maximum number of records to be searched from approximately one million to 20.
Although the ordering of the database bestows significant advantages, there is a limit to the benefit obtained. First, the database must be ordered. The work to perform the initial ordering is roughly equal to N Log.sub.2 N. In addition, each time new data is added to the database, the entire database must be reordered. Finally, the method only improves the search time for records having a specified value or range of values of the variable used to order the records.
In principle, one could maintain multiple copies of the database, each copy being ordered by different search variables. However, the additional storage space renders this approach unworkable.
A second prior art method addresses the limitation of one optimized search variable while avoiding the unacceptably large memory requirements of storing multiple copies of the database. In this method, a separate table is constructed for each search variable. This table will be referred to as a search table in the following discussion. Each search table includes a plurality of records, one such record corresponding to each record in the database. Each record has two fields. Hence, the search table has two columns. The first column is a pointer which specifies the physical location of the corresponding database record. The second column contains a value which is monotonically related to the search variable in question. The search table is then ordered by the search variable. That is, the records of the search table are rearranged such that the values in the second column appear in a monotonically increasing or decreasing order.
For example, consider the case in which three columns in the database consisted of the year, month, and day, respectively, of a date. To setup a search table for searching the database by specifying a range of dates, a search variable might be constructed by multiplying the year by 10000, adding the result to the month by 100 and then adding in the day of the month. This value would then be stored in the second field of the search record corresponding to each database record. After ordering the search the table, all records having a date between two values could be found by calculating the search variable value of the first date and performing a binary search on the second column of the search table. The search would return the first instance in which two records in the search table have second column values in which one of the values equals the search variable value, or the search variable value lies between the two variables. The records in question could then be identified by searching the records on each side of the pair of records in question.
In this second prior art method, a search table is constructed for each search variable on which searches are to be performed. Typically, the search tables require less memory space than the entire database and usually may be kept entirely in main memory. This allows for faster access with respect to the first prior art method. It should be noted that the computational effort to construct a search table is significantly greater than that needed to search the entire database once for a record satisfying some criterion based on the search variable in question. Hence, this approach represents a savings only when multiple searches are anticipated on the search variable in question.
This method has several other disadvantages. First, a large amount of redundant information is still maintained in the search table. The search table for each search variable comprises redundant information that must be maintained in addition to the database. Second, it requires storing and maintaining a search table for each possible search variable. All desirable search variables are seldom known at the time the database is being constructed. In addition, even if all search variables were known, the storage required by the corresponding search tables is often too large.
As a result, most systems utilizing search tables store a few search variable tables and utilize complex queries which reference more than one search variable. A complex query usually comprises a selection criterion based on a first variable AND a second variable. For example, assume that the geographic database discussed above included three columns for the month, day, and year on which the population data was valid, i.e., the census date.
Consider a request for all records for a given state in which the census date was after some predetermined date. A search table system might have two search tables, one for a date variable as described above and one for the state in which each city was located. The request in question would then be processed by finding all entries in the first search table which had a date after the predetermined date to obtain a first "result list" which is stored in a temporary memory buffer. The second search table would then be searched for all entries having the state variable equal to the state to produce a second result list which is stored in a second temporary memory buffer. The entries in the two result lists would then be compared against one another to find all entries that were present in both lists. If the results lists had L and M entries, respectively, this last comparison would require approximately L times M compare operations. If either L or M is large, the computational workload may actually exceed the workload needed to sequentially search the original database.
In addition to the above described problems, the relational databases itself tends to be inefficient as a mechanism for storing information. As noted above, the records must all contain the same fields. This leads to a significant amount of redundant information being stored. Consider the geographical database described above in which each record includes fields for state, county, and city. For each city in a given county, the field information for state must be repeated even though the combination of county and city uniquely defines the state.
The problem of this type of redundant information storage is addressed by a third prior art method of database system storing information which will be referred to as hierarchical databases, or tree-structure databases. A hierarchical database consists of a series of sets and subsets which store information. For storing geographical information in such a database, a country comprises a set of states. Each state includes a set of counties, and each county includes a set of cities.
The above example consist of four levels, or classes: country, states, counties, and cities. The example further comprises three relationships, or links: country to state, state to county, and county to city. The hierarchical database makes use of the observation that there is a natural progression from one level to the next level in the above relationships. The data relationships can be structured as a "tree" with the top, e.g., country in the above example, at the root node and the bottom level, e.g. city, at the "leaf" node level.
Due to the hierarchical or set-subset nature of the data in such a database, less space is required to store the information in a database. In progressing from the root node to the leaves, each successive level "inherits" the information of all levels above it. Hence, there is no need to repeat this information. The information is contained in the information which links the node in question to the node above it in the tree. The node from which a given node inherits is often referred to as its "parent".
In addition to providing more efficient storage for hierarchically ordered data, queries based on this database often may be answered without searching through all of the leaf nodes or keeping separate search tables. For example, all the cities contained within a state may be easily found by tracing from the entry in the state set for the state in question, through the county sets linked to that state, and then collecting all the cities linked to the counties found.
The advantages of the hierarchical database, however, are realized only when the information within the records of the database lends itself to a natural hierarchical ordering. Although some of the information in a typical database is hierarchical in nature, it is rarely found that all of the information may be organized into a single tree structure. If multiple trees are needed to represent the data, then complex queries must be processed in a manner analogous to that described above and the multiple result lists be combined. The combining of the results requires approximately the same computational workload as that obtained in search table ordered databases.