1. Field Of The Invention
The present invention relates to storage and access of data in a database, and, more particularly, to a method and apparatus and data structures used therein such that wherein data elements with dissimilar structures may be stored in one table, with minimal wasted space, to allow for efficient data access, each data element being stored in a record of the table and having a category designation to associate the record with a set of attributes that are stored in another location, a category hierarchy is provided and allows for attribute inheritance of a parent""s attributes by a child category.
2. Description Of The Related Art
A DBMS (Database Management System) is used to manage data and is comprised of computer-executable code that may be used to define the structure of data and to access data within the defined structure. One example of a DBMS is a relational DBMS, or RDBMS. An RDBMS manages tables that make up a relational database as well as the data contained in the tables. In an RDBMS, data is organized in rows (or records) and columns (or fields) of the tables, and two or more tables may be related based on like data values. The intersection of a row and column in a table is referred to as a cell and contains the data value for a particular field of a particular record.
A DML (data manipulation language) such as SQL (Structured Query Language) is typically used to store, retrieve and modify data in a table. A schema defines the structure of a database, i.e., each table and the fields within a record of a table. A schema is itself considered data that is stored in one or more tables. Therefore, like other data in a database, a DML may be used to store, retrieve and modify the data in the database as well as the structure of a database.
In addition to an RDBMS, other examples of data management approaches include file management systems, flat files and hierarchical database management systems. There are, however, shortcomings with the existing data management approaches.
One such disadvantage has to do with the use of hierarchically-related data in an RDBMS. A hierarchy of data necessarily involves parent/child relationships. For example, a data item can be a parent and/or a child of another data item. In a conventional RDBMS, a row in one table may be related to a row in another table. However, the relationship represents a peer (or same-level) relationship rather than a hierarchical (or parent/child) relationship. Consequently, complicated and/or nonstandard mechanisms are used in the conventional DBMS to represent a parent/child hierarchical relationship using the simple row/column tabular structure of a relational database.
FIGS. 1A through 1C provide an example of a hierarchy in the form of a tree 140 with nodes 100 to 105. Node 100 is a root node, or node with no parent node in the hierarchy. Node 100 and nodes 101 to 103 are internal nodes (i.e., nodes with at least one child). Nodes 104 and 105 are leaf nodes (i.e., a node that has no children).
In an RDBMS, a xe2x80x9csingle tablexe2x80x9d approach may be used to represent the hierarchy depicted in FIG. 1A wherein each node is a record of a table and each record contains one or more fields to store the hierarchical relationships with other nodes. A disadvantage of this approach is that it is very difficult to dynamically alter the structure of the database once it is populated with data. Thus, it is necessary to pre-allocate a specific number of fields that may be needed to store the hierarchical relationships which may be inaccurate. If there are not enough fields allocated for storing hierarchical relationships, the structure of the database (e.g., the structure of the table containing the fields that store the hierarchical relationships) must be altered before storing additional relationships in the database. Conventional approaches to re-structuring a populated database requires that existing data be moved to new tables and introduces the possibility of lost data. Thus, it has been thought to be better to err on the side of too many fields. However, this approach results in wasted space as not all records will use the maximum number of fields allocated to store the hierarchical relationships.
Referring to FIG. 1B, database 141 includes a record (in one or more tables of the database) for each of nodes 100 through 105. Specifically, record 110 corresponds to node 100 and contains a field 130 that is used to identify record 110 (i.e., a record identification, or record ID) in the database. Field 130 is also included in records 111 through 113 and contains the record ID of record 110 indicating that records 111 through 113 are child records of record 110. Records 111 through 113 each include a record ID in fields 121 through 123, respectively. Since record 114 is a child of record 111, it includes field 121 that contains the record ID of record 111.
Node 105 of tree 140 has two parents (i.e., nodes 102 and 103). To reflect this structure, record 115 of database 141 must include two fields (i.e., fields 122 and 123) that each refer to one of the parent records of record 115. Since there is no easy mechanism for dynamically altering the structure, or schema, of the table, an attempt must be made to accurately predict the number of parents a record may have when creating the database schema. As discussed above, there are tradeoffs and drawbacks with both a low and high prediction. Further, it is virtually certain that there will be wasted space no matter what the prediction.
Another approach used in an RDBMS borrows from the file management approach that is used in most operating systems and is hierarchical in nature. In a file management system, the nodes consist of a directory and a file. Each directory is an internal node of the hierarchy, while each file is a leaf node of the hierarchy. Where a file (or directory) has multiple parents, multiple copies are needed such that a copy of the file (or directory) is stored in each of the parent directories.
Using this approach in an RDBMS, each node of the hierarchy is represented as a table in the RDBMS. However, the number of tables increases with the number of internal nodes, which tends to be proportional to the size of the hierarchy. It is also extremely difficult and complex to navigate between the nodes of the hierarchy since each node is located in a separate table.
Using this approach, for example, records 110 through 115 in FIG. 1B become single-record tables with two tables created for record 115. FIG. 1C illustrates the new structure used for record is 115. Each of single-record tables 155 and 156 contain a copy of the data contained in record 115 with the exception of fields 122 and 123. Table 155 includes field 122 that identifies Node 102 as its parent, and table 156 includes field 123 to identify node 103 as its parent. Tables 155 and 156 are duplicates with the only difference between the two tables being the reference to the parent table.
Thus, a disadvantage of the file management approach is that when, as in the case of node 105, a node with multiple parents, multiple copies of the node need to be maintained as separate tables. This is wasteful of space and difficult to manage, and data integrity can easily be violated if an update is not reflected in all of the copies of the node it affects.
Hierarchical databases are an efficient approach to storing records that relate to one another in a hierarchical manner. However, records are not stored in tables and thus hierarchical databases do not enjoy the benefits of the relational model. For example, hierarchical databases do not have the ability to relate a table to one or more other tables based on like cell values.
In addition to the inefficient storage of hierarchical structures in an RDBMS, another disadvantage with existing DBMSs is in the manner in which data is stored and searched. Relational databases are designed to store tables of data where each record in the table consists of exactly the same set of fields. A problem arises when records contain fields that pertain only to a subset of the records. For example, where a table stores product information for a diverse set of products, a field used for one product may not be needed for another product. To illustrate, a printer may have printing speed (number of pages per minute) data while a monitor does not. A single record structure used to store data associated with both requires that the record include a printing speed field even though a monitor record does not have printing speed data and therefore would not use the field.
There are various approaches that have been used to address the problem of dissimilar data structures.
A xe2x80x9csingle tablexe2x80x9d approach stores all the records in a single table. Since a relational database requires that the set of fields must be the same for all records in a table, all of the attributes (i.e., a data element that is not common to all records) must be treated as if they were common to all the records. The result is a wide table that contains a very large number of fields. These fields must include not only the data elements common to all of the records but also the data elements for the entire set of attributes. In this approach, a record that does not possess a particular attribute simply does not store a data value in the corresponding cell.
The benefit of this approach is that all the records are stored in a single table, which supports efficient searching of the database. However, there is a lot of wasted space used by empty cells that do not and should not have data. Furthermore, it is difficult to distinguish between empty cells that have no data because the data is missing and those that have no data because they don""t apply to the category to which the record belongs. Conversely, this approach permits the storage of erroneous data in attribute cells that do not apply to a particular record and should be left empty.
A xe2x80x9ctable per categoryxe2x80x9d approach partitions the records into different categories based on the attributes used in each record. A table is defined for each category that includes the fields that comprise the category as well as the fields that are common to all the records and the fields common only to the records of that particular category.
This approach does provide for efficient storage of the data. However, as the number of categories increases, so does the number of tables. Data management and searching for records then become increasingly complex and time-consuming because data-based operations require that many tables be accessed.
Conventional data management systems allow a user to change the database schema after the database has been created, but changing table structures on a populated database can often cause loss of existing data. Most systems also have tools to update existing data, but these updates are typically limited to changing individual data values of individual records.
Data management systems lack the tools necessary to perform more extensive restructuring of data, such as bulk updates to large groups of records or changing data types and interrelationships between database records, while still allowing the data to be accessed (e.g., for retrieval and modification). To perform extensive restructuring, the traditional approach is either to define a correct structure prior to importing the date, or export the data to another location (e.g., a file and/or another application), edit and restructure the database and then reimport the data into the modified database structure.
However, the traditional approaches used to edit and restructure database data have several disadvantages. First, the database may be too large to practically export the entire contents into another application. Second, very few applications even provide a comprehensive set of tools to allow data to be edited and structured, and interrelationships between data records to be defined. Finally, the correct structures and interrelationships between the data records may not be immediately obvious, or may change over time, so that structuring the data may be an iterative process that is done on a continuous and ongoing basis, in which case constantly exporting and reimporting the data can become difficult or even impossible.
Thus, it would be beneficial to have a mechanism to represent hierarchies and relationships (e.g., parent/child relationships) within a hierarchy, the mechanism providing a standard approach for such representation. Further, it would be beneficial to have a mechanism for storing data records having dissimilar record structures in the same database table in an efficient manner, and for facilitating in-place schema and data manipulation.
The present invention addresses the foregoing problems and provides for the storage of data items with dissimilar structures in a set of tables while minimizing the wasted space typical of the prior art single table approach. The number of tables in the set of tables may be set, or fixed.
In one aspect of the invention, a method of efficiently storing data items in a database management system (DBMS) is provided wherein each data item has at least one attribute associated with it, the method comprising identifying a set of attributes associated with each data item, creating, for use with all of the data items, a common data structure that excludes those attributes unique to any one of the data items, and associating each data item with attributes excluded from the data structure.
That is, the elements of data (e.g., price, description, print speed and resolution) for each of the data items (e.g., monitor, printer, and scanner) to be stored in the database are identified. The data elements that all of the data items have in common are used to create a record structure for a table in which each of the common data elements is a field in the record. Further, the common record structure contains a reference that associates a data item with those data elements, or attributes, that are not common to all other of the data items to be stored using the common record structure.
Thus, it is possible using the present invention to store data items in a database table using a single record structure while eliminating wasted space typical of the convention single table approach. Contrary to prior single table approaches, there is no need to include the fields from all of the data items in each record of the table regardless of whether a field is used to store data for the data item. Rather, zero or more attributes are associated with each category, and a category is associated with each data item with a dissimilar structure. By virtue of the association with a category, it is possible to identify the fields that arc needed to store values associated with the data item in the database.
A single record structure is used to store a data item regardless of the actual fields needed to store data values associated with a data item. Accessing records in a single table is more efficient than accessing multiple tables. Further, there is no need to maintain fields that are not being used by a particular data element.
In addition to the table that contains the data item, one or more tables are used to define the categories, the attributes (or fields), the relationships between the attributes and the categories and the values associated with the attributes of the data item.
Properties may be defined for each attribute such as type. Examples of attribute type are feature and characteristic. An attribute of type feature has enumerated values from which a value (or values) for the attribute may be selected. Examples of feature attributes are color and material. The set of enumerated values is known as the feature domain. A feature can also be defined as multi-valued, which allows a feature to store multiple values for a single attribute.
A characteristic attribute is used when the possible values are not enumerated values but are continuous or numeric values, for example. A characteristic attribute is an attribute that stores continuous, numeric values. Attributes such as temperature and weight are characteristic attributes. They can further be broken down into different ratings that describe what the characteristic attribute is measuring. There can be multiple ratings such as nominal, minimum, maximum, average and typical. Other properties such as value and physical dimension (such as length or speed) with particular units of measure (such as meter or inch) may also be associated with an attribute.
In another aspect of the invention, properties may be associated with a category such as whether or not the category is an alias of another category or is hidden. Category aliases allow a category to exist in multiple places within the category hierarchy. As is discussed in more detail below, an alias designation may effect whether or not a category inherits attributes from a parent category.
Yet another aspect of the present invention provides a mechanism for representing hierarchical relationships between nodes in the hierarchy separate from the nodes themselves. In so doing, it is possible to modify the interrelationships between nodes in a hierarchy without actually modifying the nodes. For example, using the present invention a hierarchy of categories may be created such that the category hierarchy is stored separate from the category nodes. Further, the attributes associated with a parent category may be inherited by a child category.
At least one table is provided wherein each record in the table corresponds to a node in the hierarchy. The relationship(s) between data items (or nodes) in records of the table are stored separately in a relationship table. That is, the nodes and the relationships between the nodes are stored separately in one or more tables and together comprise a hierarchical structure. As with all tables in an RDBMS, node and relationship tables are relatable using the relational capabilities of the RDBMS.
Since hierarchical relationships are stored separate from the data items, it is very efficient to navigate between the data items in a hierarchy. Furthermore, if additional information needs to be associated with a hierarchical relationship, it can easily be added to the relationship table, without effecting the data item table. In addition, multiple parentage is possible without the need for duplicate copies of data items by separating the storage of the data items from the relationships among the data items.
In another aspect of the invention, the relationship table contains position information associated with the children which is stored as part of a parent/child link. Advantageously, a node""s position among other nodes in the same level of the hierarchy may be determined from the information stored in the relationship table.
In yet another aspect of the invention, a matching set defines a relationship between two or more nodes by identifying one or more matching attributes whose values may be compared to determine whether a match exists. Thus, given one node, the matching data items in another node may be identified.
In yet another aspect of the invention, hierarchy aliases provide an alternative to multiple parents that also allow a leaf node to exist in multiple locations in the hierarchy. Whereas a single node has multiple parents, a hierarchy alias is a reference to another node where both the reference and the original each have only a single parent.
In another aspect of the invention, hidden nodes are provided to support the ability to have both a detailed view and a consolidated view of the underlying data hierarchy.
This brief summary has been provided so that the nature of the invention may be understood quickly. A more complete understanding of the invention can be obtained by reference to the following detailed description of the preferred embodiment(s) thereof in connection with the attached drawings.