As is well known, a database system is a collection of interrelated data files, indexes and a set of programs that allow one or more users to add data retrieve and modify the data stored in these files. The fundamental concept of a database system is to provide users with a so called "abstract" and simplified view of the data (referred to also as data model or conceptual structure) which exempts a conventional user from dealing with details such as how the data is physically organized and accessed.
Some of the well known data models (i.e. the "Hierarchical model", "Network model", "Relational model" and "Object Relational Model" will now be briefly reviewed. A more detailed discussion can be found for example in: Henry F. Korth, Abraham Silberschatz, "Database System Concepts", McGRAW-Hill International Editions, 1986 (or the 3.sup.rd edition (1997))., Chapters 3-5 pp. 45-172
Generally speaking, all the models to be discussed below have a common property in that they represent each "entity" as a "record" have one or more "fields" each being indicative of a given attribute of the entity (e.g. a record of a given book may have the following fields "BOOK ID", "BOOK NAME", "TITLE"). Normally one or more attributes constitute a "key" i.e. it identifies the record. In the latter example "BOOK-ID" serves as a key. The various models are distinguished one from the other, inter alia, in the way that these records are organized into a more complex structure:
Relational Model
The relational model, introduced by Codd, is a landmark in the history of database development. In relational databases an abstract concept has been introduced, according to which the data is represented by tables (referred to as "relations") in which the columns represent the fields and rows represent the records.
The association between tables is only conceptual. It is not a part of the database definition. Two tables can be implicitly associated by the fact that they have one or more columns whose values are taken from the same set of values (called "domain").
Other concepts introduced by the relational model are high level operators that operate on tables (i.e., both their parameters and results are tables) and comprehensive data languages (now called 4th generation languages) in which one specifies what are the required results rather than how these results are to be produced. Such non-procedural languages (SQL--Structured Query Language) have become an industry standard. Furthermore, the relational model suggests a very high level of data independence. There should not be any effect on the programs written in these languages due to changes in the manner data are organized, stored, indexed and ordered. The relational model has become a de-facto standard for data analysts.
Network Model
In the relational model, data (and relationship between data) are regarded as a collection of tables. In distinction therefrom in the network model data are represented as a collection of records whereas relationship between the records (data) are represented as links.
A record in the network model is similar to an "entity" in the sense that it is a collection of fields each holding one type of data. The links may be effectively viewed preferably (but not necessarily) as pointers. A collection of records and the relation therebetween constitutes a collection of graphs.
Hierarchical Model
The Hierarchical Model resembles the network model in the manner that data and relations between data are treated, i.e. as records and links. However, in distinction from the network model, the records and the relations between them constitute a collection of trees rather than of arbitrary graphs. The structure of the Hierarchical Model is simple and straightforward particularly in the case that the data that needs to be organized in a database are of inherent hierarchical nature. The hierarchical model has some inherent shortcomings e.g. in many real life scenarios data cannot be easily arranged in hierarchical manner. Moreover, even if data may be organized in hierarchical manner, it may require larger volumes as compared to other database models.
Consider for example a basic entity "Employee" with the following subordinated attributes "Employee_Salary" and "Employee_Attendance". The latter may also have subordinated attributes e.g. "Employee_Entries" and "Employee_Exits". In this scenario the data is of inherent hierarchical nature and therefore should preferably be organized in the hierarchical model. Consider, for example, a scenario where "Employee" is assigned to several "Projects" and the time he/she spends ("Time_Spent") in each project is an attribute that is included in both the "Employee" and "Projects" entities. Such arrangement of data cannot be easily organized in the hierarchical model and one possible solution is to duplicate the item "Time_Spent" and hold it separately in the hierarchies of "Employee" and "Project". This approach is cumbersome and error prone in the sense that it is now required to assure that the two instances of "Time_Spent" are kept identical at all times.
Object Oriented Model
A comprehensive explanation can be found in "Object Oriented Modeling and Design", James Rumbaugh, Michael Blaha, William Premerlani, Frederick Eddi and William Lorensen.
The object-oriented approach views all entities a objects. Each object belongs to a class, with each class there are associated methods and fields. To enable encapsulation some the fields are private, accessible only to methods of the class while others are public accessible to all. Thus "Joe Smith" belongs to the class of persons. For that class, the private fields age can be defined. Applying the class method update_age() to the object Joe will change his age. The methodology allows to define sub-classes which inherit all the methods and fields of the super-class. Thus, for example, the employee class can be defined as a subclass of the person class. In addition one may define additional fields and methods to the subclass. Thus, the employee class could support a salary field, and the get_raise() method.
Object Relational Model allows an object view on relational-organized data. Thus, one is able to operate on the data as if it is organized as objects and at the same time, support the relational approach.
As mentioned in the foregoing, data models deal with the conceptual or logical level of data representation and "hide" details such as bow the data are physically arranged and accessed. The latter characteristics are normally dealt with by a so-called database file management system.
The database file management system maps the logical structure (in terms of database model) to a data structure, pertinent operations and possibly other data. The data structure includes index and data records. The index enables accessing or updating the data records by a key. In the context of search, the term search key is used. Database file management system should preferably operate on the data records so as to accomplish enhanced performance in terms of time (i.e. from the user's standpoint fast response time of the database), and space (i.e. to minimize the storage volume that is allocated for the database files). As is well known in the art, normally, there is a trade off between the time and space requirements. The performance of the database depends on the efficiency of the data structures that are used to represent the data and how efficiently the system can operate on these data. A detailed discussion on conventional file and management systems is given for example in Chapters 7 (file system structure) and 8 (indexing) in "Database System Concepts", ibid.
Known database file management systems typically utilize the following indexing schemes, which fall into the following main categories that include: Multi-way trees indexes and others.
Multi-way trees indexes
These techniques can be used to create a one or more access paths (referred to also as search paths) to the same data record. The search paths form a multi-way tree. Its main disadvantages are that it requires space (usually all the keys to the records plus some pointers) and maintenance (addition and/or deletion of keys whenever an update transaction (see definition below) occurs i.e. record is added and/or deleted. Normally, the nature of the indexing scheme as well as the volume of the data held in the files determine the number of accesses that are required to find or update (update encompasses, insert, delete or modify) a given data record. In the case that the storage medium under consideration is an external memory, the number of accesses is effectively the number of I/O accesses. As will be explained below, in each access to the storage medium a block of data is loaded into the memory.
Various types of tree indexing schemes have been developed but, normally, an indexing implementation is more costly than the specified direct access indexing techniques. On the other hand, tree indexing allows sequential and sub-range processing. One of the most widely used indexing schemes is the B-tree (under various commercial product names and implementation variants such as B.sup.+ tree) in which the keys are kept in a balanced tree structure and the lowest level points at the data itself. Detailed explanation of the B-tree indexing scheme is found in "Database System Concepts" ibid. pp. 275-282. The number of I/O accesses obeys the algorithmic expression Log.sub.K N+1 where K is an implementation dependent constant and N is the total number of records. This means that the performance slows down logarithmically as the number of records increases.
It is possible, of course, to use a combination of the above or other techniques, e.g. an indexing scheme which is implemented in accordance with two or more of the above techniques.
One of the significant drawbacks of the aforementioned popular B-tree indexing scheme is that the keys are not only held as part of the data records, but also as part of the index
This results, of course, in the undesired inflation of the index size and the latter drawback is further aggravated when indexes of large size are utilized (i.e. when a relatively large number of bits is required for representing the key).
One possible approach to cope with this problem is to exploit the Trie indexing scheme. An example of the latter is the trie discussed in G. Wiederhold, "File organization for Database design"; Mcgraw-Hill, 1987, pp. 272, 273, or in D. D. Knuth, "The Art of Computer Programming"; Addison-Wesley Publishing Company, 1973, pp. 481-505, 681-687.
Generally speaking, the trie indexing scheme enables a rapid search whilst avoiding the duplication of keys as manifested for example by the B tree technique. The trie indexing scheme has the general structure of a tree wherein the search is based on partitioning the search according to search key portions (e.g. search key digit or bit). Thus, for example each node in the trie indexing file represents an offset of the search key and the link to any one of its children represents the character's value at said offset. The trie structure affords efficient data structure in terms of the memory space that is allocated therefor, since, as specified before, the search-key is not held, as a whole, in internal nodes and hence the duplication that is exhibited for example in the B-tree indexing technique is avoided.
In a specific variant of the trie such as the trie described in "File organization for Database design" ibid., in order to achieve enhanced performance in terms of response time, a trie indexing file should be built by selecting the digits (or bits) from the search key such that the best possible partition of the search space in obtained, or in other words so as to accomplish a tree which is as balanced as possible. This, however, requires a priori knowledge of the data records of the trie and is accomplished at the penalty of obtaining an unsorted data, which in many real-life scenarios is inapplicable. It is noteworthy that if sorted data is mandatory, a balanced structure can not be guaranteed even if there is sufficient a priori knowledge of the data records of the trie. It should be noted that the specified trie does not support sequential sub-range processing.
When considering a large amount of data, it is of particular importance to maintain a so-called balanced structure of the tree index in order to avoid long paths for accessing a given data record from the root node to the leaf node that is associated with the sought data record. The specified B-tree indexing scheme, constitutes an inherent balanced tree structure, even after the tree has been subject to update transactions. The inherent balanced (or essentially balanced) structure is accomplished, however, and as explained above, at the penalty of inflating the contents of the blocks in the tree and, consequently, unduly increasing the file size that holds the index, particularly insofar as large trees which hole multitude of data records are concerned. The large volume of the files adversely affects the performance of the data management system in terms of number of accesses (and consequently in terms of accessing time) to the storage medium in order to reach a sought data record, which is obviously undesired.
Turning now to the "others" category of index schemes it includes for example the so called Skip list index: A skip list is a randomized data structure: It consists of levels, the lowermost level, level 0, consist of a list of all records ordered by non-decreasing order. Each node of level i (i=0, . . . , h) chooses, with probability p, whether to be a representative of level i+1. The representatives of level i constitute the nodes of level i+1. These representatives, too, are organized as an ordered list. Level h+1 is the first empty level.
Having discussed the major drawbacks of hitherto known index schemes i.e. inflated data volumes (e.g. B-tree and variants thereof) and susceptibility to unbalanced structure (e.g. trie), there follows a discussion in another aspect which pertain to various characteristics including subordination of data records and multi-dimensional characteristics.
Thus, consider for example, two types of data records represented as two entities (tables), i.e. Books and borrowers, each being associated with respective unique key, e.g. borrower is identified by Borrower_Id and book is identified by Book_Id. In real life scenario, such as in a public library, one is interested to view for example all books borrowed by a given borrower. The latter transaction exemplifies subordination of data records, where "books" are subordinated to "borrower". In order to resolve their query, one should apply two queries--one for the borrower information and another for the books borrowed by him (according to the composite key--book borrower)
Insofar as B-tree indexing scheme is concerned, in order to support the subordination of data in the manner specified, several separate index files are requires, as follows:
Books index file, accessible via book-Id key; PA1 Borrowers index file, accessible via borrower-Id key; PA1 Transactions via borrowers, accessible via the composite key (borrower-Id book-Id). PA1 1. The sequence of blocks of a file ordered by increasing key value of the primary key. The index leads the search to the block containing the key. To allow searches by a key that is not the primary key, a partitioned index is constructed such that for each record the partitioned index contains its key and its link. These pairs are ordered by non-decreasing value of the key. The index leads to the block containing the address of the desired record. PA1 2. A trie arranged in blocks. PA1 3. Other types of index schemes that meet the provision of partitioned index. PA1 1. at least one short link among the short links of a node (hereon split node) in the block is deleted (hereon split link) in a way that at least two tries exist in the block. PA1 2. each of the sub-trees is moved to a separate block. PA1 3. If the block B.sub.i does not exist, B.sub.i is created and a copied node of the split node is created in B.sub.1. PA1 4. If the block of B.sub.i exists and a copied node of the split node does not exist in B.sub.1, then a copied node of the split node is created in B.sub.l and connected to the trie of B.sub.i such that B.sub.i-1 ' (at the end of the split process) is accessible in a search path that includes the root node in B.sub.i and the copied node and its labeled links according to the representative key of B.sub.i-1 '. PA1 5. If the copied node has no direct link, add a direct link from the copied node to the block B.sub.i-1. PA1 6. Add a far link from the copied node to the block B.sub.i-1 ' or if the copied node has a short link to a child node in the direction of the far link, the far link can be replaced by a direct link from the child node to block B.sub.i-1 '. PA1 a layered index arranged in blocks; the layered index includes a basic partitioned index that is associated with data records; the basic partitioned index enables accessing or updating the data records by key or keys, and being susceptible to an unbalanced structure of blocks; PA1 said layered index enables accessing or updating the data records by key or keys and constitutes a balances structure of blocks. PA1 an index arranged in blocks and being constructed over the keys of data records; the index includes a basic partitioned index that is associated with the data records; the basic partitioned index enables accessing or updating the data records by key or keys, and being susceptible to an unbalanced structure of blocks; PA1 said index enables accessing or updating the data records by key or keys and constitutes a balanced structure of blocks. PA1 an index arranged in blocks and being constructed over the keys of data records; the index includes a trie that is associated with the data records; the trie enables accessing or updating the data records by key or keys, and being susceptible to an unbalanced structure of blocks; PA1 said index enables accessing or updating the data records by key or keys and constitutes a balanced structure of blocks. PA1 a method for constructing a layered index arranged in blocks, comprising the steps of: PA1 a method for constructing an index over the keys of the data records, the index being arranged in blocks, comprising the steps of: PA1 a method for constructing an index over the keys of the data records, the index being arranged in blocks, comprising the steps of: PA1 the leave nodes of said PAIF are associated each with at least one data record accessible to said user application program and wherein at least portion of said data record constitutes at least one search-key; PA1 selected nodes in said PIAF represent, each, a given offset of a search key portion within said inset search key; link(s) originated from each given node from among said selected nodes, represent, each, a unique value of said search key portion; PA1 the PIAF having at least two sub-PIAF's being arranged, each, in a block; PA1 said data base file management system is further capable of arranging said blocks as a balanced structure of blocks. PA1 The data are held inherently in sorted form according to search key. Namely, One can navigate in the tree by the order of the keys of the data records. The layered index inherently supports sequential operations like "get next" and "get previous". In this respect, the proposed layered index constitutes an advantage over e.g. hashing scheme and some implementations of digital trees. PA1 There is no requirement for in advance knowledge of the contents of the database, in order to maintain balanced index. PA1 A balanced layered index is retained and the depth of the layered index is relatively small, thereby minimizing the number of accesses (normally slow I/O operations) that are required to perform update transaction or access data record. According to one embodiment, practically one I/O (and no more than two I/O) operation (constituting one or two access) is required in order to access a given data record from among billions data records. PA1 a data structure that includes an index over the keys of the data records; the index is arranged in blocks; such that for one billion data records substantially no more than two accesses to said external memory are required in order to access a block that is associated with any one of said billion data records, irrespective of the size of the key of said data records. PA1 a designated index over designated keys of data records; the data records, constituting designated data records, being of at least two types where designated data records of the second type are subordinated to the designated data records of the first type. PA1 The data structure that includes designated index and designated data can maintain the relations between different data items. PA1 The data structure that includes designated index and designated data can link logically related items. PA1 The data structure that includes designated index and designated data allows high efficiency in retrieving relating data. PA1 an index being stored in the storage medium and constructed over the keys of said data records that are stored in blocks; the index being arranged in blocks with the leaf blocks being linked to data records by means of links; PA1 said index is characterized in that at least one of said links is shared by at least two data records stored in the same block. PA1 an index being stored in a storage medium and constructed over the keys of said data records that are stored in blocks; the index being arranged in blocks with the leaf blocks being linked to data records by means of links; PA1 said index is characterized in that at least one of said links is shared by at least two data records stored in the same block; PA1 said index constituting a layered index according to claim 1, and blocks of said basic partitioned index are linked to said data records.
Accordingly, the index scheme includes here three index files. This obviously poses undesired overhead insofar as data volumes and additional integrity maintenance and checking are concerned. Thus, for example, removal of a given book from the book file requires a preliminary test to inquire whether it exists in the borrower-book index file.
Having discussed the drawbacks of hitherto known techniques insofar as subordination of data records are concerned, the cumbersome representation and manner of operation thereof becomes even worth considering implementations of the so called multi-dimensional data records.
Reverting now to the latter example, the table Books and borrowers are now regarded as multi-dimensional tables, which can be reached from several views. Thus, in addition to the above mentioned borrower.fwdarw.book view (books borrowed by borrower(s) which is implemented by an index over the borrower-book composite key; the database should support the alternative view of borrowers that borrowed a given book(s), which requires, of course, to utilize the alternative composite key (book-borrower).
In the Btree representation, it is accordingly required to add another index file accessible via the composite key (book-Id borrower-Id), giving rise to a total of four index files.
The pertinent drawbacks are self explanatory and become even worth for n dimensional tables (n&gt;2).
There is accordingly a need in the art to reduce the drawbacks of data processing systems that exploit hitherto known database file management system. Specifically, there is a need in the art to provide for a data processing system that exhibits database performance by utilizing an efficient database file management system.
There is yet further need in the art to provide for a database file management system that utilizes an index which inherently being not susceptible to unbalanced structure in the manner specified above.
There is still further need in the art to provide for an index which inherently supports representation of multiple types of data, subordination of data records and/or multi-dimensions.