As is well known, a database system is a collection of interrelated data files and a set of programs that allow one or more users to add data retrieve and modify the data, stored in these files. The fundamental concept of a database system is to provide users with a so called "abstract" and simplified view of the data (referred to also as data model or conceptual structure) which exempts a conventional user from dealing with details such as how the data is physically organized and accessed.
Some of the well known data models (i.e. the "Hierarchical model", "Network model" and "Relational model" will now be briefly reviewed. A more detailed discussion can be found for example in: Henry F. Korth, Abraham Silberschatz, "Database System Concepts", McGRAW-Hill International Editions, 1986, Chapters 3-5 pp. 45-172).
Generally speaking, all the models discussed below have a common property in that they represent each "entity" as a "record" having one or more "fields" each being indicative of a given attribute of the entity (e.g. a record of a given book may have the following fields "BOOK ID", "BOOK NAME", "TITLE"). Normally one or more attributes constitute a "key" i.e. it uniquely identifies the record. In the latter example "BOOK-ID" serves as a key. The various models are distinguished one from the other, inter alia, in the way that these records are organized into a more complex structure.
Relational Model--The relational model, introduced by Codd, is a landmark in the history of database development. In relational databases an abstract concept has been introduced, according to which the data is represented by tables (referred to as "relations") in which the columns represent the fields and rows represent the records.
The association between tables is only conceptual. It is not a part of the database definition. Two tables can be implicitly associated by the fact that they have one or more columns whose values are taken from the same set of values (called "domain").
Other concepts introduced by the relational model are high level operators that operate on tables (i.e., both their parameters and results are tables) and comprehensive data languages (now called 4th generation languages) in which one specifies what are the required results rather than how these results are to be produced. Such non-procedural languages (SQL--Structured Query Language) have become an industry standard. Furthermore, the relational model suggests a very high level of data independence. There should not be any effect on the programs written in these languages due to changes in the manner data are organized, stored, indexed and ordered. The relational model has become a de-facto standard for data analysts.
Network Model--In the relational model, data (and relationship between data) are regarded as a collection of tables. In distinction therefrom in the network model data are represented as a collection of records whereas relationship between the records (data) are represented as links.
A record in the network model is similar to an "entity" in the sense that it is a collection of fields each holding one type of data. The links may be effectively viewed as pointers. A collection of records and the relation therebetween constitutes a collection of arbitrary graphs.
Hierarchical Model--The Hierarchical Model resembles the network model in the manner that data and relations between data are treated, i.e. as records and links. However, in distinction from the network model, the records and the relations between them constitute a collection of trees rather than of arbitrary graphs. The structure of the Hierarchical Model is simple and straightforward particularly in the case that the data that needs to be organized in a database are of inherent hierarchical nature. Consider for example a basic entity "Employee" with the following subordinated attributes "Employee_Salary" and "Employee_Attendance". The latter may also have subordinated attributes e.g. "Employee_Entries" and "Employee_Exits". In this scenario the data is of inherent hierarchical nature and therefore should preferably be organized in the hierarchical model. This, however, is typically not the case. Consider, for example, a scenario where "Employee" is assigned to several "Projects" and the time he/she spends ("Time_Spent") in each project is an attribute that is included in both the "Employee" and "Projects" entities. Such arrangement of data cannot be easily organized in the hierarchical model and one possible solution is to duplicate the item "Time_Spent" and hold it separately in the hierarchies of "Employee" and "Project". This approach is cumbersome and error prone in the sense that it is now required to assure that the two instances of "Time_Spent" are kept identical at all times. Since in real life scenarios arrangements of data that do not have inherent hierarchial structure are very common, the hierarchial model is inappropriate for serving as a database in many real-world scenarios.
As mentioned in the foregoing, data models deal with the conceptual or logical level of data representation and "hide" details such as how the data are physically arranged and accessed. The latter characteristics are normally dealt with by a so-called "database file management system". The main goal of the database file and system management (referred to occasionally also as "database engine") is to enhance database performance in terms of time (i.e. from the user's standpoint fast response time of the database), and space (i.e. to minimize the storage volume that is allocated for the database files). As is well known in the art, normally, there is a trade off between the time and space requirements. The performance of the database depends on the efficiency of the data structures that are used to represent the data and how efficiently the system can operate on these data. A detailed discussion on conventional file and management systems is given for example in Chapters 7 (file system structure) and 8 (indexing and hashing) in "Database System Concepts", ibid.
A database engine maps the logical structure into physical files and affords access path to the database records. The following techniques are typically utilized by known database engines in order facilitate access to data.
Hashing--This technique is usually a very quick method for locating a record once the value of its key is known. It involves the translation of the key into a pointer by some formula and then a direct access. Its drawbacks are that only one access key can be used on the same record and that a good translation ("hashing") formula is not always available. The access usually requires one I/O operation, but there are cases when the formula maps more than one record to the same position. This situation requires additional operations to resolve the conflicts. If the "hashing" formula is good these cases are rare, and therefore the average number of I/O operations is somewhat greater but not much greater than one.
Full indexing--This technique can be used to create a virtually unlimited number of access paths to the same data. The index is a search pattern, which ultimately locates the data. Its main disadvantages are that it requires space (usually all the keys to the records plus some pointers) and maintenance (addition and/or deletion of keys whenever a record is added and/or deleted respectively, or when its key is updated). Normally, the nature of the indexing technique as well as the volume of the data held in the files determine the number of I/O operations that are required to retrieve, insert, delete or modify a given data record.
Various types of indexing schemes have been developed but, normally, an indexing implementation is more costly than the aforementioned techniques. On the other hand, indexing is the simplest and most common method for acquiring multiple access paths to the same data. One of the most widely used indexing algorithms is the B-TREE (under various commercial product names) in which the keys are kept in a balanced tree structure and the lowest level points at the data itself.
Detailed explanation of the B.sup.+ Tree indexing algorithm (and its derivative indexing algorithm the B-Tree) can be found in "Database System Concepts" ibid. pp. 275-282. The number of I/O operations obeys the algorithmic expression Log.sub.K N+1 where K is an implementation dependent constant and N is the total number of records. This means that the performance slows down exponentially as the number of records increases, which will at some point, cause unacceptably slow response time.
It is possible, of course, to use a combination of the above or other techniques, e.g. an indexing technique in combination with a linked list of records (i.e. records that are serially linked by means of one or bi-directional pointers). Normally, the beginning of the list (i.e. the first record in the list) is acceded by indexing technique and thereafter the pointers are followed until the sought record is found.
One of the significant drawbacks of the aforementioned popular B.sup.+ -Tree indexing algorithm is that the indices portion of the data is not only held as an integral portion of the data of the leaves of the tree, but is also held in the interim nodes of the tree serving as a search path for realizing "FIND", "INSERT" and/or "DELETE" record actions. This results, of course, in the undesired inflation of the database size and the latter drawback is further aggravated when indexes of large size are utilized (i.e. when a relatively large number of bits is required for representing the index).
One possible approach to cope with this problem is to exploit the tries (pronounced "try-S") indexing technique discussed, for example, in G. Wiederhold, "File organization for Database design"; Mcgraw-Hill, 1987, pp. 272, 273.
Generally speaking, the tries indexing technique enables a rapid search whilst avoiding the duplication of indexes as manifested for example by the B.sup.+ technique. The tries indexing file has the general structure of a tree wherein the search is based on partitioning the search according to search key portions (e.g. search key digit or bit). Thus, for example each node in the tries indexing file represents a digit position of a search key and the link to any one of its children represents the digit's value. The tries structure affords efficient data structure in terms of the memory space that is allocated therefor, since the search-key is not held, as a whole, in interim nodes and hence the duplication that is exhibited for example in the B.sup.+ indexing technique is avoided.
In order to achieve enhanced performance in terms of response time, a tries indexing file should be built by selecting the digits (or bits) from the search key such that the best possible partition of the search space in obtained, or in other words so as to accomplish a tree which is as balanced as possible.
Hitherto known tries indexing file structures have inherent drawbacks. Thus, for example, as is well known to those versed in the art and as explained in "File organization for Database design", ibid., the goal of obtaining a balanced tree necessitates prior knowledge of index values (which necessarily entails prior knowledge of the data records in the file). However, normally there is no prior knowledge of the contents of a database (e.g. consider a database that holds an inventory of items stored in a warehouse. Clearly the inventory of items dynamically changes as new shipments of items are either received from suppliers or delivered to clients), the drawbacks of the pre-requisite requirement of knowing the contents of the database in order to accomplish efficient tries structure are obvious. Another clear drawback of the conventional tries structure is that the data is not kept in a sorted form which hinders the conducting of the efficient search of related items (e.g. this difficulty is exhibited for example when responding to the query: in a database that holds particulars of supplier, retrieve the full name and address of all suppliers having a surname that starts with `A`). Accordingly, the tries indexing file of the kind specified exhibits only a theoretical concept which from commercial standpoint is practically infeasible.
It is therefore the object of the present invention to reduce the drawbacks of data processing systems that exploit hitherto known database file management system. Specifically, it is the object of the present invention to provide for a data processing system that exhibits an enhanced database performance by utilizing an efficient database file management system.