1. Field of Invention
The present invention relates generally to computer database systems and particularly to relational database systems and methods for storing and retrieving hierarchical data structures.
2. Description of Prior Art
A DataBase is a collection of stored operational data used by the application systems of some particular enterprise. [Date, 1986] A DataBase Management System (DBMS) is computer software that provides secure, reliable, shared access to databases whose integrity is assured via transaction processing and mechanisms for backup and recovery of databases in the presence of accidental or intentional disruption.
Note: The term “database” has become widely used and in many cases associated with any collection of data, organized or not. In this application, I use the term written as “DataBase” to distinguish between imprecise notions of a database as any collection of data and a DataBase as a collection of data with an organization that reflects an operational model for the applications which use it. The more formal notion will be regarded as consistent with usage by persons skilled in the pertinent art of database design and applications development.
As is well understood in the art, the interactive DBMS user or programmer of DBMS applications interacts with the DBMS through a logical data model. That model represents an organization of all data into structured records, structured records into collections, and defined relationships between and among the records collections. The DBMS maintains a single, unified physical model of data storage and maps actions on one or more logical data models to this underlying physical model.
The Prior Art includes DBMSs which support logical data models of at least four types: Hierarchical, Network, Relational and Object-Oriented.
The building block of the Hierarchical Data Model is a two level one-to-many hierarchy between one parent record of a parent record class P and zero or more child records of a child record class C. A child record in a two level P-C hierarchy may be the parent record in another two level C-G hierarchy with parent class C and child class G. When viewed in this way, the composition of these two level hierarchies is a three level hierarchy over the classes P, C and G. By extension, hierarchies of any number of levels may be defined and populated with data. For example, a hierarchy over the classes State, County, Municipality, District, Street and Address could be used to create a six level hierarchy over every addressable building in the United States. Hierarchical DataBase Management Systems (HDBMS) implement the Hierarchical Data Model via languages for defining and populating logical hierarchies and for logically navigating over hierarchies to retrieve, modify and delete existing records. IBM's IMS, one of the first commercial DBMS products conforms to a Hierarchical Data Model.
The Network Model is similar to the Hierarchical Model, but uses the equivalent construction of a Set [CODASYL, 1969, 1971] to represent the simplest one-to-many relationship with Set Owner and Set Member(s) in the roles of hierarchical parent and child(ren). A DBMS based on the Network Model provides language constructs for defining, populating and navigating over these Sets. The commercial product ADABAS is based on the Network Model.
The Relational Model presents an entirely different logical representation as a collection of relational tables, also called relations. A relation is a collection of related objects, each object represented by one record (row) of a relation (table). In relational terminology, the terms relation and table are used interchangeably. However the definition of a relation is derived from a more formal, mathematical characterization, and relations in Relational DataBases are more restricted in their specification and use than tables. The rows of a relational table are referred to as tuples, rows or records and represent objects. Unlike the hierarchical and network models, relationships between records in different relations are not explicitly represented, but are materialized by the Relational DBMS (RDBMS) in response to requests for service called queries. Queries are expressed in a query language. The most popular and standardized for the Relational Model is Structured Query Language, or SQL. An SQL query defines a response relation in terms one or more relations in the Relational DataBase. The RDBMS treats the query as a request and materializes the response relation by accessing the physical storage of the referenced relations. In effect, an SQL query defines a relation and a RDBMS materializes the defined relation. An RDMBS typically provides both an interactive end user interface and a programmatic (or “call”) interface. The interactive interface simply materializes all response records. The call interface also includes programming language functions for navigating backwards and forward through the set of response records. Note that individual relations act as containers, and that relationships between these containers are not explicitly represented.
Object-Oriented (OO) DataBase Management Systems (OODBS) support the same generic modeling capability that characterizes and has popularized Object-Oriented Programming Languages like ADA [Booch, 1983] and C++ [Lippman, 1991]. From the perspective of an application written in an OO programming language, an OODB (Object-Oriented DataBase) provides storage for data that are activated (retrieved) as needed and passivated (stored back to the DataBase) when no longer needed. Navigation is implicit in the relationships between and among object types, so that actual OODB access requires very little additional skill beyond the ability to program in these languages. This natural fit between Object-Oriented Programming Languages and Object-Oriented DataBase Management Systems makes the latter especially valuable in environments characterized by a heavy software development workload and the requirement of large volumes of shared, persistent data.
Historically, the first DBMSs commercially available were based on the Hierarchical and Network models. These products were introduced in the late 1950s and early 1960s. The Relational Model of Data was introduced in a seminal paper in 1970 by E. F. Codd [Codd, 1970]. The Relational approach to databases touched off a revolution in database research and development. The first experimental RDBMSs were developed in the middle 1970's: IBM's System R [Astrahan,1976] and INGRES [Stonebraker,1976]. By the late 1970's, commercial RDBMS products like Oracle, Sybase and IBM's DB2 became available. Sales growth for RDBMSs exploded and relational technology became the dominant technology in new business, scientific and engineering applications. Hierarchical and Network DBMSs continued to be used, though mainly in legacy systems.
In the mid 1980's, growth in the popularity of Object-Oriented Programming Languages for applications development leveraged growth in the Object-Oriented segment of the DBMS market. It appeared that OODBMS technology might displace RDBMS technology just as RDBMS technology had supplanted the earlier hierarchical and network technologies. That did not in fact occur. For enterprises that have developed large DataBases and devote major expense to applications development, Object-Oriented Programming Languages have proven to be more cost effective than earlier software development approaches, and many of these organization continue to use OODBMSs. However, for reliable data management, RDBMSs and the relational approach consistently dominate the market for commercial DBMSs. Some OODBMS vendors implement an Object-Oriented data model using an RDBMS, thus offering the best of both worlds: Efficient software development against an OO logical model and the reliability and flexibility of the relational model for database management and preservation of these valuable assets.
The value of large data collections can almost always be enhanced through use of a DBMS, and the Relational DBMS is the overwhelming choice for business, science and engineering. Many data management applications are well matched to the tabular view of data provided by the relational model of data provided by an RDBMS. But many others are far easier to approach when the data can be organized hierarchically. Because RDBMSs are widely available and provide such high levels of security and reliability, several innovators have attempted to develop techniques for representing hierarchies and other related data structures using Relational DataBases and the SQL query language. The Prior Art for such techniques includes the following:
Goldberg et. Al. [U.S. Pat. No. 5,201,046, 1993] describe a “method for storing, retrieving and modifying directed graph data structures” using an RDBMS. A hierarchy is a restricted case of a directed graph, so that a directed graph technique could also be used to represent hierarchies in a RDB. The approach extends the SQL query language with two new languages constructs (“EXPAND” and “DEPTH <N>”) and a new data type (“REFERENCE”) to represent a pointer from a record in one relation to a record in a second (possibly the same) relation.
Simonetti [U.S. Pat. No. 5,295,261, 1994] describes a “Hybrid database structure linking navigational fields having a hierarchical database structure to informational fields having a relational database structure”. In this method, that portion representing the hierarchical database structure is contained in a topological map stored as a file external to the relational database.
Sacks [U.S. Pat. No. 5,974,407, 1999] describes a “Method and apparatus for implementing a hierarchical database management system (HDBMS) using a relational database management system (RDBMS) as the implementing apparatus”. The method employs the Relational DataBase as a virtual (mechanical) means for implementing a Hierarchical DataBase Management System (HDBMS). The schema of the individual hierarchical tables, the permissible parent-child relationships in the Hierarchical DataBase, the definitions of subset views and the actual representation of an individual hierarchy is captured in five relations. A sixth relation makes it possible to store multiple hierarchies in the same Hierarchical DataBase. Access to the underlying Relational DataBase Management Systems and other Relational DataBases are allowed, but the hierarchical data itself cannot be correctly interpreted via the SQL mechanism without the interface specified in the patent implemented in an appropriate programming language.
Jagadish [Jagadish, 1989] describes a method for “Incorporating Hierarchy in a Relational Model of Data,” but his method requires extensions to the SQL query language and a new data type.
Millett [Millett, 2001] provides two methods for “Accommodating Hierarchies in Relational Databases”. The first (Path approach) involves the use of a “navigation bridge table” relation that stores all pairwise parent-child links of a hierarchy. The second (Denormalized Unit Table approach) is suitable only where the maximum number of levels in the hierarchy is known a priori. In this method, a data record in a relation that participates in a hierarchy includes pointers to each of its ancestors.
Finally, note the industry standard for the SQL query language, ANSI/ISO/IEC 9075-2-1999 [ISO/SQL, 1999]. The most recent update to this standard includes a “WITH RECURSIVE ORG” statement which defines an operation for retrieving the transitive closure of a directed graph over a hierarchy of relations and the computation of aggregate values over elements of that hierarchy. This standardized extension to the SQL query language indicates interest in representing directed graphs (digraphs) in Relational DataBases, but the feature is not widely available in commercial RDBMS products and is restricted to a style of hierarchical representation in which pointers to parents are stored within data records. Implicitly, this limits a record in a relation to participate only in as many hierarchies as the number of fields defined for that purpose. The reason for this limit is as in Millett, above: Because one record field is required for each hierarchy within which the record participates, the number of such hierarchies must be known when the schema is defined, before any records are inserted into the hierarchy.
All Prior Art techniques suffer from one or more of the following limitations:
Limitation 1: Use of non-SQL language extensions, or non-native data types, or both. Modifications to SQL or addition of non-trivial data types is not permitted by commercial products. This is because non-SQL language extensions imply the addition of internal search mechanisms, and, non-native data types cannot be correctly interpreted by a standard RDBMS. Goldberg introduces the “REFERENCE” data type and the EXPAND and DEPTH <N>language constructs to SQL. Sacks employs composite keys and a front end interface to interpret them. Applications which conform to SQL standards are portable; applications which introduce language extensions and/or non-native data types are not. This limitation eliminates one of the primary advantages for using a standard RDBMS to implement hierarchies.
Limitation 2: Use of external data structures to represent the hierarchies. RDBMSs only support relations as containers for data records. There are no other storage structures. The use of external data structures (typically in files) implies that the security, protection, backup/recovery and overall integrity that the RDBMS provides for relations is not available. Simonetti uses an external topological map to represent the hierarchical structure.
Limitations 3: Prohibition against arbitrary hierarchies limits applicability. The Millett Denormalized Unit Table Approach requires a priori knowledge of the maximum number of levels in a hierarchy. The Millett Path Table Approach can only represent a single hierarchy. For multiple hierarchies, each requires a separate Path Table. In applications where entire hierarchies are manipulated (combined or deleted for example), there are no obvious methods for treating subhierarchies as attachable/detachable units. Any technique in which parent- and/or child pointers are included in data relations implicitly limits the number of hierarchies in which the records of that relation can participate and reduces the generality of the approach.
Limitation 4: The Relational representation of records in the hierarchies is proprietary. That is, it is subject to correct interpretation only by specialized interface software. As a consequence, such data cannot also be accessed and correctly interpreted by the interactive user or programmed application via the conventional SQL interface to the RDBMS. The Sacks method suffers from this limitation.
Limitation 5: The implementation is too complex for all but the most skilled practitioners of relational database art. Simonetti and Sacks both require extensive front-end development. The Miller/Path approach involves the INNER JOIN operator which is not understood by most SQL programmers.
In contrast, the current invention is entirely implemented within a standard Relational DataBase Management System with no language extensions to the standard SQL query language, no new data types and no specialized front end interface. Unlike Goldberg, Sacks and Jagadish, the current invention is portable across all SQL compliant RDBMSs. The current invention employs no external data structures or files, so that all benefits of the DBMS approach are obtained. This is in contrast to Simonetti, who uses an external topological map. In the current invention, there are no artificial limits on the numbers of hierarchies, the number of levels in a hierarchy, or the number of hierarchies in which a record in any relation may simultaneously participate. In addition, a record may participate multiple times as a child in a hierarchy and at any number of levels within any hierarchy. This is in contrast to Millett who limits the number of levels in the hierarchy or the number of hierarchies. The current invention permits relational access to the underlying data: All relational data records participating in hierarchies are accessible via the standard SQL relational query mechanism. This is in contrast to Sacks, where a Hierarchical DataBase may coexist with a Relational DataBase since both are supported by a Relational DataBase Management System, but access to the Relational DataBase via Hierarchical views or access to the Hierarchical DataBase via Relational views is not possible. In an applications environment that employs Relational DataBases, the Hierarchical views provided by the current invention may be introduced without disruption to or reprogramming of existing applications. And finally, the present invention implementation is simple, straightforward and robust.