1. Field of the Invention
Embodiments of the invention described herein pertain to the field of computer databases. More particularly, but not by way of limitation, one or more embodiments of the invention enable a method of representing an XML schema definition and data within a relational database management system using a reusable custom-defined nestable compound data type.
2. Description of the Related Art
Early computer database models required designers to access data through low level pointer operations that linked records through physical addresses. Querying the database required an understanding of the physical structure of the tables. To create a school registration system, for example, a designer might create a large table containing a row for every student. Such a table would require columns for data about the students (such as ID number), classes registered for, dates and times of the student's classes, class locations and teacher for each class. Such a table is easy to understand, but would waste space repeating data items such as a teacher's name in row after row of students registered for the same class.
Relational database management systems (RDBMS) were developed in the 1970's to provide database designers with a method of efficiently organizing data in an abstract form that takes advantage of the natural relationship of data items to each other. E. F. Codd's idea to organize data by relationships separate from its physical storage method was revolutionary. A relational database provides a more flexible solution in part because the abstraction created a more maintainable and efficient data model. For example, a database designer could create a database design (schema) that represents a school registration system by placing data regarding students in one table, teachers in another table, classes offered in a third table and class registrations in a fourth table. Each table would have columns (attributes) representing only the kinds of information relevant to the data stored in that table. The data itself would be stored in table rows (records). In a given record, the attributes may be referred to as fields of the record. Thus, the student table would have one row for each student; the teacher table would have one row for each teacher; and so on. Each student record would have fields for a student ID number and student name.
FIG. 1 illustrates a relational database schema for a simplified school registration system. The design reduces storage and highlights the relationship between the various data items. Because multiple students will register for the same class, information about the class is not required in each registration record; instead, only a reference (pointer) to the class information is required. Computer memory need not be wasted on placeholders or repeated data. Each table in the relational model includes a small primary key field that makes each row of the table unique, even if a student, for example, unintentionally registered for the same class twice. In FIG. 1, attributes 115, 125, 135, 145, and 155 represent primary key fields.
During design or maintenance it may become apparent that a one-to-many relationship is required between data items in a record. Typically, the solution is to break up such data into separate tables. Unfortunately, this solution may cause the relationship between the records to be become obscured The RDBMS designer solves this problem by including “foreign keys” as well as primary keys in the record of the new table. While a primary key makes a row unique, a foreign key relates the data in a row of one table to data in a row of another table. In an RDBMS a table is also referred to as a relation. Thus, in Registration Table 130, RID 135 is a primary key, while COID 156 and SID 116 are foreign keys that relate each registration record to a student (a record in Student Table 110) and a class offering (a record in Class Offering Table 150). Class Offering Table 150 contains primary key COID 155, and foreign keys CID 146 (from Class Table 140) and TID 126 (Teachers table 120). This schema, when populated with data, can produce a course catalog report (using Classes Table 140 and Class Offering Table 150), a Teacher schedule (using Teachers Table 120 and Class Offering Table 150), a student schedule showing classes, class titles, credits, class location, and teacher's name (using all tables 110, 120, 130, 140 and 150), all with a minimum of duplicated data and wasted space.
A database designer generally creates a database schema, such as that illustrated by FIG. 1, manually. The designer must gather information regarding all the required reports, data available, import and export requirements and the relationships between the data items and then must synthesis the information into a database schema such as that depicted in FIG. 1. For large databases with complex relations, schema design may require a team of designers with sophisticated design tools and a significant amount of design time. Each relation and attribute will be described in a data dictionary and individually instantiated.
Traditionally, the definition of a set of related fields for a conventional database table is neither reusable nor portable. Neither does current schema design systems support flexible or extensible user defined data constructs. When a designer creates a database schema, creates tables to support the schema, and then populates the tables with data, the design becomes fixed. If the database design or organization must later be changed, it will often require export of all the data, redesign of the database, re-creation of the tables and importation of the data into the new schema—if such an operation is even practical at all. Thus, RDBMS schemas for large database systems are complex to reuse and complex to port from one system to another. Even mining data from a large system to import into an alternative schema can be an overwhelming task. Yet today, database evolution and sharing of data between unrelated databases is an essential part of the way we work and do business.
Various approaches to managing data import for large databases into RDBMS systems have been attempted, but all suffer from an inherent drawback of the RDBMS concept—that the data relationships in an RDBMS are not exposed from the source to the destination. FIG. 1A illustrates the direction of relationships in the example of FIG. 1. A designer maintaining Class Offering Table 150 is not exposed to the fact that COID attribute 155 is relied upon by Registration Table 130 through reference 157. The same can be said for relations 117, 127, and 147. Thus, the dependency of one table on another is obscured. Further, the view presented to a designer is a “decomposed” view of all the records of one table or all the records of another table. When designs include teams of designers and hundreds of tables, this lack of visibility becomes a disadvantage.
Another disadvantage of a traditional RDBMS schema approach is that, in many instances, data relationships in the real world are nested by nature. In a contact management system, a single contact may have more than one address (business, home, mailing, billing, and shipping, for example) and phone number, email account and reminder date associated with the single entry. FIG. 2 illustrates a simple contact management system in an RDBMS. Each Customer has an identifier (ID 220), a name, street, city, state, zip and country. However, Customers Table 210 cannot support allowing Customer 220, Joe's Pizza, to have two addresses—one for delivery and one for billing, for example.
A traditional design solution would break the relation into two tables to allow the data to have a one-to-many relationship between some information and other information. FIG. 2A illustrates this traditional solution. Customers Table 210′ now has two attributes (columns), a primary key and a customer name. Address Table 230′ contains the fields removed from Customer Table 210 of FIG. 2, with the addition of the foreign key attribute, CustID. Now, a search for Joe's Pizza (CustID 260) will return two entries from Address Table 230′: Row 250 (Addr ID 901) and Row 260 (Addr ID 904), because both have a matching CustID 220 (containing 1001.) Ideally, at least one additional attribute would be added to Address Table 230′, not shown in the Figure, to indicate the purpose of the address stored in each row of Address Table 230′.
At least one disadvantage of this traditional approach is that there is no exposure for an observer of Customers Table 210′ to the relationship of customer name and customer address because of the RDBMS primary key-foreign key design approach. While this lack of visibility seems trivial in this example, in a database with hundreds of tables the lack of exposure may cause errors in the maintenance of the database. Furthermore, the designer viewing either table sees a “decomposed” view of all the records of one table or all the records of the other table.
For an RDBMS containing millions of records, the problems of expansion and maintenance are only amplified. Traditional RDBMS systems and techniques may be insufficient to produce a manageable and maintainable design for very large databases. Yet many businesses would like to be able to automatically add new supplier, customer, and product data, for example, to a very large RDBMS. Such data is often available in formats such as eXtensible Markup Language (XML), Unified Modeling Language (UML), Hypertext Markup Language (HTML), comma separated variable format, tab-delineated text, Data Interchange Format (DIF), electronic spreadsheet formats, and many others. However, this data generally reflects the structure of the source database, which may be different from that of the destination. It would be advantageous to be able to insulate the importer from the schema of the destination database. This may be challenging in very large systems, however, because it places the burden of understanding and establishing the hierarchical nature of the data on the importing program. The importing program must decompose the source data into tables and discern and establish the “backward referencing” relationships (parent record primary key as a foreign key in the child) required of an RDBMS database, such as those like Relationship 240. This could require multiple passes over the data. With very large amounts of data, such limitations may be a significant obstacle to correct data import, schema expansion and/or maintenance of the database.
Thus, for at least the limitations described above, there is a need for a reusable custom-defined nestable data construct for use in database schema design. Ideally, such a construct would comprise a customizable compound user-define data type. A system embodying such a construct would allow a database designer to define a customizable compound record template that groups together and names a set of fields into a nestable, reusable object or type definition that describes or reflects the relation between the fields, without requiring the type to be instantiated into a table. The construct would not require the designer to redesign or rebuild existing tables when the compound data type is changed, but would instead propagate the change transparently. Finally, the construct would be nestable to increase the ability to decompose data relationships and reuse definitions.
It would also be advantageous to be able to automatically generate and/or expand a very large database using imported data using a customizable compound user-defined data type to more naturally represent hierarchical data relationships. However, automating such tasks requires tools capable of automatically interpreting data relationships, generating an appropriate database schema, and then populating that schema with the nested data. Thus, for at least the limitations described above, there is a need for a method of representing an XML schema definition and data within a relational database management system using a reusable custom-defined nestable compound data type. Such a method would enable a clean mapping between XML data utilized outside the context of an RDBMS and XML data stored within an RDBMS. Furthermore, a method of representing an XML schema definition and data within a relational database management system using a reusable custom-defined nestable compound data type would also be advantageous.