1. Field of the Invention
The present invention relates to storage and access of data in a database, and, more particularly, to a method and apparatus and data structures used therein such that wherein data elements with dissimilar structures may be stored in one table, with minimal wasted space, to allow for efficient data access, each data element being stored in a record of the table and having a category designation to associate the record with a set of attributes that are stored in another location, a category hierarchy is provided and allows for attribute inheritance of a parent's attributes by a child category.
2. Description of the Related Art
A DBMS (Database Management System) is used to manage data and is comprised of computer executable code that may be used to define the structure of data and to access data within the defined structure. One example of a DBMS is a relational DBMS, or RDBMS. An RDBMS manages tables that make up a relational database as well as the data contained in the tables. In an RDBMS, data is organized in rows (or records) and columns (or fields) of the tables, and two or more tables may be related based on like data values. The intersection of a row and column in a table is referred to as a cell and contains the data value for a particular field of a particular record.
A DML (data manipulation language) such as SQL (Structured Query Language) is typically used to store, retrieve and modify data in a table. A schema defines the structure of a database, i.e., each table and the fields within a record of a table. A schema is itself considered data that is stored in one or more tables. Therefore, like other data in a database, a DML may be used to store, retrieve and modify the data in the database as well as the structure of a database.
In addition to an RDBMS, other examples of data management approaches include file management systems, flat files and hierarchical database management systems. There are, however, shortcomings with the existing data management approaches.
One such disadvantage has to do with the use of hierarchically related data in an RDBMS. A hierarchy of data necessarily involves parent/child relationships. For example, a data item can be a parent and/or a child of another data item. In a conventional RDBMS, a row in one table may be related to a row in another table. However, the relationship represents a peer (or same level) relationship rather than a hierarchical (or parent/child) relationship. Consequently, complicated and/or nonstandard mechanisms are used in the conventional DBMS to represent a parent/child hierarchical relationship using the simple row/column tabular structure of a relational database.
FIGS. 1A through 1C provide an example of a hierarchy in the form of a tree 140 with nodes 100 to 105. Node 100 is a root node, or node with no parent node in the hierarchy. Node 100 and nodes 101 to 103 are internal nodes (i.e., nodes with at least one child). Nodes 104 and 105 are leaf nodes (i.e., a node that has no children).
In an RDBMS, a “single table” approach may be used to represent the hierarchy depicted in FIG. 1A wherein each node is a record of a table and each record contains one or more fields to store the hierarchical relationships with other nodes. A disadvantage of this approach is that it is very difficult to dynamically alter the structure of the database once it is populated with data. Thus, it is necessary to pre allocate a specific number of fields that may be needed to store the hierarchical relationships which may be inaccurate. If there are not enough fields allocated for storing hierarchical relationships, the structure of the database(e.g., the structure of the table containing the fields that store the hierarchical relationships) must be altered before storing additional relationships in the database. Conventional approaches to re structuring a populated database requires that existing data be moved to new tables and introduces the possibility of lost data. Thus, it has been thought to be better to err on the side of too many fields. However, this approach results in wasted space as not all records will use the maximum number of fields allocated to store the hierarchical relationships.
Referring to FIG. 1B, database 141 includes a record (in one or more tables of the database) for each of nodes 100 through 105. Specifically, record 110 corresponds to node 100 and contains a field 130 that is used to identify record 110 (i.e., a record identification, or record ID) in the database. Field 130 is also included in records 111 through 113 and contains the record ID of record 110 indicating that records 111 through 113 are child records of record 110. Records 111 through 113 each include a record ID in fields 121 through 123, respectively. Since record 114 is a child of record 111, it includes field 121 that contains the record ID of record 111.
Node 105 of tree 140 has two parents (i.e., nodes 102 and 103). To reflect this structure, record 115 of database 141 must include two fields (i.e., fields 122 and 123) that each refer to one of the parent records of record 115. Since there is no easy mechanism for dynamically altering the structure, or schema, of the table, an attempt must be made to accurately predict the number of parents a record may have when creating the database schema. As discussed above, there are tradeoffs and drawbacks with both a low and high prediction. Further, it is virtually certain that there will be wasted space no matter what the prediction.
Another approach used in an RDBMS borrows from the file management approach that is used in most operating systems and is hierarchical in nature. In a file management system, the nodes consist of a directory and a file. Each directory is an internal node of the hierarchy, while each file is a leaf node of the hierarchy. Where a file (or directory) has multiple parents, multiple copies are needed such that a copy of the file (or directory) is stored in each of the parent directories. Using this approach in an RDBMS, each node of the hierarchy is represented as a table in the RDBMS. However, the number of tables increases with the number of internal nodes, which tends to be proportional to the size of the hierarchy. It is also extremely difficult and complex to navigate between the nodes of the hierarchy since each node is located in a separate table.
Using this approach, for example, records 110 through 115 in FIG. 1B become single record tables with two tables created for record 115. FIG. 1C illustrates the new structure used for record 115. Each of single record tables 155 and 156 contain a copy of the data contained in record 115 with the exception of fields 122 and 123. Table 155 includes field 122 that identifies Node 102 as its parent, and table 156 includes field 123 to identify node 103 as its parent. Tables 155 and 156 are duplicates with the only difference between the two tables being the reference to the parent table.
Thus, a disadvantage of the file management approach is that when, as in the case of node 105, a node with multiple parents, multiple copies of the node need to be maintained as separate tables. This is wasteful of space and difficult to manage, and data integrity can easily be violated if an update is not reflected in all of the copies of the node it affects.
Hierarchical databases are an efficient approach to storing records that relate to one another in a hierarchical manner. However, records are not stored in tables and thus hierarchical databases do not enjoy the benefits of the relational model. For example, hierarchical databases do not have the ability to relate a table to one or more other tables based on like cell values.
In addition to the inefficient storage of hierarchical structures in an RDBMS, another disadvantage with existing DBMSs is in the manner in which data is stored and searched. Relational databases are designed to store tables of data where each record in the table consists of exactly the same set of fields. A problem arises when records contain fields that pertain only to a subset of the records. For example, where a table stores product information for a diverse set of products, a field used for one product may not be needed for another product. To illustrate, a printer may have printing speed (number of pages per minute) data while a monitor does not. A single record structure used to store data associated with both requires that the record include a printing speed field even though a monitor record does not have printing speed data and therefore would not use the field.
There are various approaches that have been used to address the problem of dissimilar data structures.
A “single table” approach stores all the records in a single table. Since a relational database requires that the set of fields must be the same for all records in a table, all of the attributes (i.e., a data element that is not common to all records) must be treated as if they were common to all the records. The result is a wide table that contains a very large number of fields. These fields must include not only the data elements common to all of the records but also the data elements for the entire set of attributes. In this approach, a record that does not possess a particular attribute simply does not store a data value in the corresponding cell.
The benefit of this approach is that all the records are stored in a single table, which supports efficient searching of the database. However, there is a lot of wasted space used by empty cells that do not and should not have data. Furthermore, it is difficult to distinguish between empty cells that have no data because the data is missing and those that have no data because they don't apply to the category to which the record belongs. Conversely, this approach permits the storage of erroneous data in attribute cells that do not apply to a particular record and should be left empty.
A “table per category” approach partitions the records into different categories based on the attributes used in each record. A table is defined for each category that includes the fields that comprise the category as well as the fields that are common to all the records and the fields common only to the records of that particular category.
This approach does provide for efficient storage of the data. However, as the number of categories increases, so does the number of tables. Data management and searching for records then become increasingly complex and time-consuming because data-based operations require that many tables be accessed.
Conventional data management systems allow a user to change the database schema after the database has been created, but changing table structures on a populated database can often cause loss of existing data. Most systems also have tools to update existing data, but these updates are typically limited to changing individual data values of individual records.
Data management systems lack the tools necessary to perform more extensive restructuring of data, such as bulk updates to large groups of records or changing data types and interrelationships between database records, while still allowing the data to be accessed (e.g., for retrieval and modification). To perform extensive restructuring, the traditional approach is either to define a correct structure prior to importing the date, or export the data to another location (e.g., a file and/or another application), edit and restructure the database and then reimport the data into the modified database structure.
However, the traditional approaches used to edit and restructure database data have several disadvantages. First, the database may be too large to practically export the entire contents into another application. Second, very few applications even provide a comprehensive set of tools to allow data to be edited and structured, and interrelationships between data records to be defined. Finally, the correct structures and interrelationships between the data records may not be immediately obvious, or may change over time, so that structuring the data may be an iterative process that is done on a continuous and ongoing basis, in which case constantly exporting and reimporting the data can become difficult or even impossible.
Thus, it would be beneficial to have a mechanism to represent hierarchies and relationships (e.g., parent/child relationships) within a hierarchy, the mechanism providing a standard approach for such representation. Further, it would be beneficial to have a mechanism for storing data records having dissimilar record structures in the same database table in an efficient manner, and for facilitating in place schema and data manipulation