The present invention relates generally to database systems, and in particular to using a hierarchical index to access hierarchically organized information in a relational database system.
Humans tend to organize information in categories. The categories in which information is organized are themselves typically organized relative to each other in some form of hierarchy. For example, an individual animal belongs to a species, the species belongs to a genus, the genus belongs to a family, the family belongs to an order, and the order belongs to a class.
With the advent of computer systems, techniques for storing electronic information have been developed that largely reflected this human desire for hierarchical organization. Conventional computer file systems, for example, are typically implemented using hierarchy-based organization principles. Specifically, a typical file system has directories arranged in a hierarchy, and documents stored in the directories. Ideally, the hierarchical relationships between the directories reflect some intuitive relationship between the meanings that have been assigned to the directories. Similarly, it is ideal for each document to be stored in a directory based on some intuitive relationship between the contents of the document and the meaning assigned to the directory in which the document is stored.
FIG. 1 shows an example of a typical file system. The illustrated file system includes numerous directories arranged in a hierarchy. Two documents 118 and 122 are stored in the directories. Specifically, documents 118 and 122, both of which are entitled xe2x80x9cExample.docxe2x80x9d, respectively stored in directories 116 and 124, which are respectively entitled xe2x80x9cWordxe2x80x9d and xe2x80x9cApp4xe2x80x9d.
In the directory hierarchy, directory 116 is a child of directory 114 entitled xe2x80x9cWindowsxe2x80x9d, and directory 114 is a child of directory 110. Similarly, directory 124 is a child of directory 126 entitled xe2x80x9cVMSxe2x80x9d, and directory 126 is a child of directory 110. Directory 110 is referred to as the xe2x80x9crootxe2x80x9d directory because it is the directory from which all other directories descend. In many systems, the symbol xe2x80x9c/xe2x80x9d is used to refer to the root directory.
When electronic information is organized in a hierarchy, each item of information may be located by following a xe2x80x9cpathxe2x80x9d through the hierarchy to the entity that contains the item. Within a hierarchical file system, the path to an item begins at the root directory and proceeds down the hierarchy of directories to eventually arrive at the directory that contains the item of interest. For example, the path to file 118 consists of directories 110, 114 and 116, in that order.
Hierarchical storage systems often allow different items to have the same name. For example, in the file system shown in FIG. 1, both of the documents 118 and 122 are entitled xe2x80x9cExample.docxe2x80x9d. Consequently, to unambiguously identify a given document, more than just the name of the document is required.
A convenient way to identify and locate a specific item of information stored in a hierarchical storage system is through the use of a xe2x80x9cpathnamexe2x80x9d. A pathname is a concise way of uniquely identifying an item based on the path through the hierarchy to the item. A pathname is composed of a sequence of names. In the context of a file system, each name in the sequence of names is a xe2x80x9cfilenamexe2x80x9d. The term xe2x80x9cfilenamexe2x80x9d refers to both the names of directories and the names of documents, since both directories and documents are considered to be xe2x80x9cfilesxe2x80x9d.
Within a file system, the sequence of filenames in a given pathname begins with the name of the root directory, includes the names of all directories along the path from the root to the item of interest, and terminates in the name of the item of interest. Typically, the list of directories to traverse is concatenated together, with some kind of separator punctuation (e.g., xe2x80x98/xe2x80x99, xe2x80x98 xe2x80x99, or xe2x80x98;xe2x80x99) to make a pathname. Thus, the pathname for document 118 is /Windows/Word/Example.doc, while the pathname for document 122 is /NVMS/App4/Example.doc.
The relationship between directories (files) and their contained content varies significantly between different types of hierarchically organized systems. One model, employed by various implementations, such as Windows and DOS file systems, requires each file to have exactly one parent, forming a tree. In a more complicated model, the hierarchy takes the form of a directed graph, where files can have multiple parents, as in the UNIX file system in which hard links are used.
In contrast to hierarchical approaches to organizing electronic information, a relational database stores information in tables comprised of rows and columns. Each row is identified by a unique RowID. Each column represents an attribute of a record, and each row represents a particular record. Data is retrieved from the database by submitting queries to a database management system (DBMS) that manages the database.
Each type of storage system has advantages and limitations. A hierarchically organized storage system is simple, intuitive, and easy to implement, and is a standard model used by most application programs. Unfortunately, the simplicity of the hierarchical organization does not provide the support required for complex data retrieval operations. For example, the contents of every directory may have to be inspected to retrieve all documents created on a particular day that have a particular filename. Since all directories must be searched, the hierarchical organization does nothing to facilitate the retrieval process.
A relational database system is well suited for storing large amounts of information and for accessing data in a very flexible manner. Relative to hierarchically organized systems, data that matches even complex search criteria may be easily and efficiently from a relational database system. However, the process of formulating and submitting queries to a database server is less intuitive than merely traversing a hierarchy of directories, and is beyond the technical comfort level of many computer users.
In the past, hierarchically organized systems and relationally organized systems have been implemented in different ways that were not compatible. With some additional processing, however, a relationally organized system can emulate a hierarchically organized system. This type of emulation is especially desirable when the storage capability and flexibility of a relational system is needed, but the intuitiveness and ubiquity of the hierarchical system is desired.
Such emulation may be implemented through the use of two relational tables: a xe2x80x9cFilexe2x80x9d table and a xe2x80x9cDirectory13 linksxe2x80x9d table. The File table stores information relating to each file in the emulated hierarchical system. For files that are documents, the File table further stores either the body of the file (in the form of a large binary object (BLOB)), or a pointer to the body of the document. The Directory13 links table stores all of the link information that indicates the parent-child relationships between files.
To illustrate how these two tables may be used to emulate a hierarchical storage system, suppose that a file system having the hierarchical structure of FIG. 1 is implemented in a database. The file system of FIG. 1 can be illustrated as follows (a unique ID, shown in parentheses, is assigned by the system to uniquely identify each file):
/(X1)
Windows (X2)
Word (X3)
Example.doc (X4)
Access (X5)
Unix (X6)
App1 (X7)
App2 (X8)
VMS (X9)
App3 (X10)
App4 (X11)
Example.doc (X12)
FIG. 2 shows a files table 210, and FIG. 3 shows a directory links table 310, which may be used by a computer system to emulate the file system of FIG. 1 in a relational database system. Files table 210 contains an entry for each file in the system. Each entry includes a RowID, a file ID, a name, a body column, and a modification date column (plus other system-maintained information such as creation date, access permission information, etc.).
The FileID is a unique ID assigned to each file by the system, the name is the name assigned to the file, which does not need to be unique, and the body is the field in which the contents of a file are stored. The body field may store the actual contents of a file in the form of a binary large object (BLOB), or a pointer to the contents of the file. Where the entry is for a file having no content (e.g. a directory), the body field is null. In the above example, only the two documents entitled Example.doc have content; thus, the body field for all of the other entries is null.
In directory links table 310, an entry is stored for each link. between files in the file system of FIG. 1. Each entry includes a parent ID, a child ID, and a child13 name field. For each link, the parent ID field specifies the file which is the parent file for the link, the child ID field specifies the file which is the child file for the link, and the child13 name field specifies the name of the child file in the link. Thus, for example, in the entry for the link between root directory 110 and Windows directory 114, directory links table 310 specifies that X1 (the FileID of the root directory) is the Parent ID, X2 (the FileID of the Windows directory) is the child ID, and xe2x80x9cWindowsxe2x80x9d is the child13 name.
To illustrate how the information in these two tables may be used to implement the file system of FIG. 1, suppose that it is necessary to access document 118. As explained above, document 118 has the path: /Windows/Word/Example.doc. To access this file, the makes an initial scan of directory links table 310 to find the entry where root directory 110 is the parent file and Windows directory 114 is the child file. To do this, the DBMS executes something like the following SQL statement:
Select ChildID
from directory13 links
Where ParentID=xe2x80x9cX1xe2x80x9d
child13 name=xe2x80x9cWindowxe2x80x9d.
This query returns the ID of the child file, which in this case is X2 (for Windows directory 114). After obtaining the ID of the child file, the DBMS makes a second scan of the directory links table 310, this time looking for the entry where the parent file is Windows directory 114, and the child file is Word directory 116. This is achieved by executing the following Select statement:
Select ChildID
from directory13 links
Where ParentID=xe2x80x9cX2xe2x80x9d and
Child13 name=xe2x80x9cWordxe2x80x9d.
This query returns the ID of Word directory 116, which in this example is X3. With this information, the DBMS makes a third scan of directory links table 310, this time searching for the entry where the parent file is Word directory 116 and the child file is Example.doc document 118. This is achieved with the following Select statement:
Select ChildID
from directory13 links
Where ParentID=xe2x80x9cX3xe2x80x9d and
Child13 name=xe2x80x9cExample.docxe2x80x9d
At the end of this process, the ID of document 118 will have been determined. Using this ID as the primary key, the proper entry in files table 210 is located, and the contents of 118 are accessed from the body field. Thus, using this technique, files that are actually stored in a relational structure, such as table 210, may be located and accessed using pathnames just as if they were stored in a hierarchically organized structure. The user submitting the pathname to locate a file need not understand the complexity of a relational system. Conversely, because the files are stored in a relational system, the files may be efficiently accessed in more sophisticated ways by users that is familiar with relational systems.
At the end of this process, the ID of document 118 will have been determined. Using this ID as the primary key, the proper entry in files table 210 is located, and the contents of document 118 are accessed from the body field. Thus, using this technique, files that are actually stored in a relational structure, such as table 210, may be located and accessed using pathnames just as if they were stored in a hierarchically organized structure. The user submitting the pathname to locate a file need not understand the complexity of a relational system. Conversely, because the files are stored in a relational system, the files may be efficiently accessed in more sophisticated ways by users that are familiar with relational systems.
For purposes of simplifying the performance evaluation, assume that the performance of the pathname resolution is directly proportional to the number of disk blocks that must be accessed. (This is because disk access is an order of 100 slower than memory access, and for large file systems, we can assume that most of the data will be on disk.) Assuming that a conventional BTREE index has been built on the directory13 links table 310, the number of disk block accesses that a DBMS has to perform in one scan of a table is log(n)/m where n is the number of entries in the table and m is the number of entries per block. Thus, using the pathname resolution technique described above, the average number of accesses that the DBMS will have to perform to access a file is i*log(n)/m.
For complex file systems having a large number of links and a large number of link levels, the number of disk accesses can become quite large. Since I/O operations are very slow relative to other types of operations, having a large number of disk accesses can significantly degrade system performance. In general, the pathname resolution method described is quite slow when applied to large, complex file systems.
While the data structure of the table is appropriate for easy access to the file system data from relational applications, the standard types of indexes used by databases are too slow for efficient pathname-based access to data organized in this manner. What is needed, then, is an apparatus, system or method that emulates a hierarchically organized system using a relationally organized system, but which reduces the number of disk accesses necessary to locate the items in the system when they are accessed based on their pathnames.
A method and mechanism are provided for creating, maintaining, and using a hierarchical index to efficiently access information in a relational system based on a pathnames, thus emulating a hierarchically organized system.
According to one aspect of the invention, each item that has any children in the emulated hierarchical system has an index entry in the index. The index entries in the index are linked together in a way that reflects the hierarchical relationship between the items associated with the index entries. Specifically, if a parent-child relationship exists between the items associated with two index entries, then the index entry associated with the parent item has a direct link to the index entry associated with the child item.
According to another aspect of the invention, pathname resolution is performed by following direct links between the index entries associated with the items in a pathname, according to the sequence of the filenames within the pathname. By using an index whose index entries are linked in this manner, the process of accessing the items based on their pathnames is significantly accelerated, and the number of disk accesses performed during that process is significantly reduced.