As is well known, a database system is a collection of interrelated data files, indexes and a set of programs that allow one or more users to add data retrieve and modify the data stored in these files. The fundamental concept of a database system is to provide users with a so called "abstract" and simplified view of the data (referred to also as data model or conceptual structure) which exempts a conventional user from dealing with details such as how the data is physically organized and accessed.
Some of the well known data models (i.e. the "Hierarchical model", "Network model", "Relational model" and "Object Relational Model" will now be briefly reviewed. A more detailed discussion can be found for example in: Henry F. Korth, Abraham Silberschatz, "Database System Concepts", McGRAW-Hill International Editions, 1986 (or the 3.sup.rd edition (1997))., Chapters 3-5 pp. 45-172.
Generally speaking, all the models to be discussed below have a common property in that they represent each "entity" as a "record" having one or more "fields" each being indicative of a given attribute of the entity (e.g. a record of a given book may have the following fields "BOOK ID", "BOOK NAME", "TITLE"). Normally one or more attributes constitute a "key" i.e. it identifies the record. In the latter example "BOOK-ID" serves as a key. The various models are distinguished one from the other, inter alia, in the way that these records are organized into a more complex structure:
Relational Model--The relational model, introduced by Codd, is a landmark in the history of database development. In relational databases an abstract concept has been introduced, according to which the data is represented by tables (referred to as "relations") in which the columns represent the fields and rows represent the records.
The association between tables is only conceptual. It is not a part of the database definition. Two tables can be implicitly associated by the fact that they have one or more columns whose values are taken from the same set of values (called "domain").
Other concepts introduced by the relational model are high level operators that operate on tables (i.e., both their parameters and results are tables) and comprehensive data languages (now called 4th generation languages) in which one specifies what are the required results rather than how these results are to be produced. Such non-procedural languages (SQL--Structured Query Language) have become an industry standard. Furthermore, the relational model suggests a very high level of data independence. There should not be any effect on the programs written in these languages due to changes in the manner data are organized, stored, indexed and ordered. The relational model has become a de-facto standard for data analysts.
Network Model--In the relational model, data (and relationship between data) are regarded as a collection of tables. In distinction therefrom in the network model data are represented as a collection of records whereas relationship between the records (data) are represented as links.
A record in the network model is similar to an "entity" in the sense that it is a collection of fields each holding one type of data. The links may be effectively viewed preferably (but not necessarily) as pointers. A collection of records and the relation therebetween constitutes a collection of graphs.
Hierarchical Model--The Hierarchical Model resembles the network model in the manner that data and relations between data are treated, i.e. as records and links. However, in distinction from the network model, the records and the relations between them constitute a collection of trees rather than of arbitrary graphs. The structure of the Hierarchical Model is simple and straightforward particularly in the case that the data that needs to be organized in a database are of inherent hierarchical nature. The hierarchical model has some inherent shortcomings, e.g. in many real life scenarios data cannot be easily arranged in hierarchical manner. Moreover, even if data may be organized in hierarchical manner, it may require larger volumes as compared to other database models.
Consider for example a basic entity "Employee" with the following subordinated attributes "Employee_Salary" and "Employee_Attendance". The latter may also have subordinated attributes e.g. "Employee_Entries" and "Employee_Exits". In this scenario the data is of inherent hierarchical nature and therefore should preferably be organized in the hierarchical model. Consider, for example, a scenario where "Employee" is assigned to several "Projects" and the time he/she spends ("Time_Spent") in each project is an attribute that is included in both the "Employee" and "Projects" entities. Such arrangement of data cannot be easily organized in the hierarchical model and one possible solution is to duplicate the item "Time_Spent" and hold it separately in the hierarchies of "Employee" and "Project". This approach is cumbersome and error prone in the sense that it is now required to assure that the two instances of "Time_Spent" are kept identical at all times.
Object Oriented Model--A comprehensive explanation can be found in "Object Oriented Modeling and Design", James Rumbaugh, Michael Blaha, William Premerlani, Fredrick Eddi and William Lorensen.
The object-oriented approach views all entities a objects. Each object belongs to a class, with each class there are associated methods and fields. To enable encapsulation some the fields are private, accessible only to methods of the class while others are public accessible to all. Thus "Joe Smith" belongs to the class of persons. For that class, the private fields age can be defined. Applying the class method update_age() to the object Joe will change his age. The methodology allows to define sub-classes which inherit all the methods and fields of the super-class. Thus, for example, the employee class can be defined as a subclass of the person class. In addition one may define additional fields and methods to the subclass. Thus, the employee class could support a salary field, and the get raise() method.
Object Relational Model allows an object view on relational-organized data. Thus, one is able to operate on the data as if it is organized as objects and at the same time, support the relational approach. As mentioned in the foregoing, data models deal with the conceptual or logical level of data representation and "hide" details such as how the data are physically arranged and accessed. The latter characteristics are normally dealt with by a so-called database file management system.
The database file management system maps the logical structure (in terms of database model) to a data structure, pertinent operations and possibly other data. The data structure includes index and data records. The index enables accessing or updating the data records by a key. In the context of search, the term search key is used. Database file management system should preferably operate on the data records so as to accomplish enhanced performance in terms of time (i.e. from the user's standpoint fast response time of the database), and space (i.e. to minimize the storage volume that is allocated for the database files). As is well known in the art, normally, there is a trade off between the time and space requirements. The performance of the database depends on the efficiency of the data structures that are used to represent the data and how efficiently the system can operate on these data. A detailed discussion on conventional file and management systems is given for example in Chapters 7 (file system structure) and 8 (indexing) in "Database System Concepts", ibid.
Known database file management systems typically utilize the following indexing schemes, which fall into the following main categories that include: Multi-way trees indexes and others.
Multi-way trees indexes--These techniques can be used to create a one or more access paths (referred to also as search paths) to the same data record. The search paths form a multi-way tree. Its main disadvantages are that it requires space (usually all the keys to the records plus some pointers) and maintenance (addition and/or deletion of keys whenever an update transaction (see definition below) occurs i.e. record is added and/or deleted. Normally, the nature of the indexing scheme as well as the volume of the data held in the files determine the number of accesses that are required to find or update (update encompasses, insert, delete or modify) a given data record. In the case that the storage medium under consideration is an external memory, the number of accesses is effectively the number of I/O accesses. As will be explained below, in each access to the storage medium a block of data is loaded into the memory.
Various types of tree indexing schemes have been developed but, normally, an indexing implementation is more costly than the specified direct access indexing techniques. On the other hand, tree indexing allows sequential and sub-range processing. One of the most widely used indexing schemes is the B-tree (under various commercial product names and implementation variants such as B.sup.+ tree) in which the keys are kept in a balanced tree structure and the lowest level points at the data itself. Detailed explanation of the B-tree indexing scheme is found in "Database System Concepts" ibid. pp. 275-282. The number of I/O accesses obeys the algorithmic expression Log.sub.K N+1 where K is an implementation dependent constant and N is the total number of records. This means that the performance slows down logarithmically as the number of records increases.
It is possible, of course, to use a combination of the above or other techniques, e.g. an indexing scheme which is implemented in accordance with two or more of the above techniques.
One of the significant drawbacks of the aforementioned popular B-tree indexing scheme is that the keys are not only held as part of the data records, but also as part of the index.
This results, of course, in the undesired inflation of the index size and the latter drawback is further aggravated when indexes of large size are utilized (i.e. when a relatively large number of bits is required for representing the key).
One possible approach to cope with this problem is to exploit the Trie indexing scheme. An example of the latter is the trie discussed in G. Wiederhold, "File organization for Database design"; Mcgraw-Hill, 1987, pp. 272, 273, or in D. E. Knuth, "The Art of Computer Programming"; Addison-Wesley Publishing Company, 1973, pp. 481-505, 681-687.
Generally speaking, the trie indexing scheme enables a rapid search whilst avoiding the duplication of keys as manifested for example by the B tree technique. The trie indexing scheme has the general structure of a tree wherein the search is based on partitioning the search according to search key portions (e.g. search key digit or bit). Thus, for example each node in the trie indexing file represents an offset of the search key and the link to any one of its children represents the character's value at said offset. The trie structure affords efficient data structure in terms of the memory space that is allocated therefor, since, as specified before, the search-key is not held, as a whole, in internal nodes and hence the duplication that is exhibited for example in the B-tree indexing technique is avoided.
In a specific variant of the trie such as the trie described in "File organization for Database design" ibid., in order to achieve enhanced performance in terms of response time, a trie indexing file should be built by selecting the digits (or bits) from the search key such that the best possible partition of the search space in obtained, or in other words so as to accomplish a tree which is as balanced as possible. This, however, requires a priori knowledge of the data records of the trie and is accomplished at the penalty of obtaining an unsorted data, which in many real-life scenarios is inapplicable. It is noteworthy that if sorted data is mandatory, a balanced structure can not be guaranteed even if there is sufficient a priori knowledge of the data records of the trie. It should be noted that the specified trie does not support sequential sub-range processing.
When considering a large amount of data, it is of particular importance to maintain a so-called balanced structure of the tree index in order to avoid long paths for accessing a given data record from the root node to the leaf node that is associated with the sought data record. The specified B-tree indexing scheme, constitutes an inherent balanced tree structure, even after the tree has been subject to update transactions. The inherent balanced (or essentially balanced) structure is accomplished, however, and as explained above, at the penalty of inflating the contents of the blocks in the tree and, consequently, unduly increasing the file size that holds the index, particularly insofar as large trees which hold multitude of data records are concerned. The large volume of the files adversely affects the performance of the data management system in terms of number of accesses (and consequently in terms of accessing time) to the storage medium in order to reach a sought data record, which is obviously undesired.
Turning now to the "others" category of index schemes it includes for example the so called Skip list index: A skip list is a randomized data structure: It consists of levels, the lowermost level, level 0, consist of a list of all records ordered by non-decreasing order. Each node of level i (i=0, . . . , h) chooses, with probability p, whether to be a representative of level i+1. The representatives of level i constitute the nodes of level i+1. These representatives, too, are organized as an ordered list. Level h+1 is the first empty level.
Having discussed the major drawbacks of hitherto known index schemes i.e. inflated data volumes (e.g. B-tree and variants thereof) and susceptibility to unbalanced structure (e.g. trie), there follows a discussion in another aspect which pertain to various characteristics including subordination of data records and multi-dimensional characteristics.
Thus, consider for example, two types of data records represented as two entities (tables), i.e. Books and borrowers, each being associated with respective unique key, e.g. borrower is identified by Borrower_Id and book is identified by Book_Id. In real life scenario, such as in a public library, one is interested to view for example all books borrowed by a given borrower. The latter transaction exemplifies subordination of data records, where "books" are subordinated to "borrower". In order to resolve this query, one should apply two queries--one for the borrower information and another for the books borrowed by him (according to the composite key--book borrower).
Insofar as B-tree indexing scheme is concerned, in order to support the subordination of data in the manner specified, several separate index files are requires, as follows:
Books index file, accessible via book-Id key; PA1 Borrowers index file, accessible via borrower-Id key; PA1 Transactions via borrowers, accessible via the composite key (borrower-Id book-Id). PA1 1. The sequence of blocks of a file ordered by increasing key value of the primary key. The index leads the search to the block containing the key. To allow searches by a key that is not the primary key, a partitioned index is constructed such that for each record the partitioned index contains its key and its link. These pairs are ordered by non-decreasing value of the key. The index leads to the block containing the address of the desired record. PA1 2. A trie arranged in blocks. PA1 3. Other types of index schemes that meet the provision of partitioned index. PA1 1. at least one short link among the short links of a node (hereon split node) in the block is deleted (hereon split link) in a way that at least two tries exist in the block. PA1 2. each of the sub-trees is moved to a separate block. PA1 3. If the block of B.sub.i does not exist, B.sub.i is created and a copied node of the split node is created in B.sub.i. PA1 4. If the block of B.sub.i exists and a copied node of the split node does not exist in B.sub.i, then a copied node of the split node is created in B.sub.i and connected to the trie of B.sub.i such that B.sub.i-1 ' (at the end of the split process) is accessible in a search path that includes the root node in B.sub.i and the copied node and its labeled links according to the representative key of B.sub.i-1 '. PA1 5. If the copied node has no direct link, add a direct link from the copied node to the block B.sub.i-1. PA1 6. Add a far link from the copied node to the block B.sub.i-1 ' or if the copied node has a short link to a child node in the direction of the far link, the far link can be replaced by a direct link from the child node to block B.sub.i-1 '. PA1 (One) providing said basic partitioned index; PA1 (Two) constructing a representative index over the representative keys of said basic partitioned index; said layered index enables accessing or updating the data records by key or keys and constitutes a balanced structure of blocks. PA1 (1) providing said basic partitioned index; PA1 (2) constructing an index over the representative keys of said basic partitioned index; said index enables accessing or updating the data records by key or keys and constitutes a balanced structure of blocks. PA1 (1) providing a trie; PA1 (2) constructing an index over the representative keys of said trie; said index enables accessing or updating the data records by key or keys and constitutes a balanced structure of blocks. PA1 The data are held inherently in sorted form according to search key. Namely, One can navigate in the tree by the order of the keys of the data records. The layered index inherently supports sequential operations like "get next" and "get previous". In this respect, the proposed layered index constitutes an advantage over e.g. hashing scheme and some implementations of digital trees. PA1 There is no requirement for in advance knowledge of the contents of the database, in order to maintain balanced index. PA1 A balanced layered index is retained and the depth of the layered index is relatively small, thereby minimizing the number of accesses (normally slow I/O operations) that are required to perform update transaction or access data record. According to one embodiment, practically one I/O (and no more than two I/O) operation (constituting one or two access) is required in order to access a given data record from among billions data records. PA1 The data structure that includes designated index and designated data can maintain the relations between different data items. PA1 The data structure that includes designated index and designated data can link logically related items. PA1 The data structure that includes designated index and designated data can support several data models simultaneously and efficiently. PA1 The data structure that includes designated index and designated data allows high efficiency in maintaining data integrity. PA1 The data structure that includes designated index and designated data allows high efficiency in retrieving relating data.
Accordingly, the index scheme includes here three index files. This obviously poses undesired overhead insofar as data volumes and additional integrity maintenance and checking are concerned. Thus, for example, removal of a given book from the book file requires a preliminary test to inquire whether it exists in the borrower-book index file.
Having discussed the drawbacks of hitherto known techniques insofar as subordination of data records are concerned, the cumbersome representation and manner of operation thereof becomes even worth considering implementations of the so called multidimensional data records.
Reverting now to the latter example, the tables Books and borrowers are now regarded as multi-dimensional tables, which can be reached from several views. Thus, in addition to the above mentioned borrower.fwdarw.book view (books borrowed by borrower(s) which is implemented by an index over the borrower-book composite key, the database should support the alternative view of borrowers that borrowed a given book(s), which requires, of course, to utilize the alternative composite key (book-borrower).
In the Btree representation, it is accordingly required to add another index file accessible via the composite key (book-Id borrower-Id), giving rise to a total of four index files.
The pertinent drawbacks are self explanatory and become even worth for n dimensional tables (n&gt;2).
There is accordingly a need in the art to reduce the drawbacks of data processing systems that exploit hitherto known database file management system. Specifically, there is a need in the art to provide for a data processing system that exhibits database performance by utilizing an efficient database file management system.
There is yet further need in the art to provide for a database file management system that utilizes an index which inherently being not susceptible to unbalanced structure in the manner specified above.
There is still further need in the art to provide for an index which inherently supports representation of multiple types of data, subordination of data records and/or multi-dimensions.