1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to an optimized method and system for decomposing markup based documents, such as XML documents, into a relational database wherein multiple items are decomposed into the same table-column pair without dedicated mapping constructs.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
Extensible Markup language (XML) is a standard data-formatting mechanism used for representing data on the Internet in a hierarchical data format and for information exchange. An XML document consists of nested element structures, starting with a root element.
Decomposition of an XML document is the process of breaking the document into component pieces and storing those pieces in a database. The specification of the pieces and where they are to be stored is accomplished by means of a mapping document. One format of mapping documents is the Document Access Definition (DAD), utilized as one aspect of the DB2 XML Extender v7 and v8, providing decomposition function. Another type of mapping documents is in the form of a set of XML schema documents that describe the structure and data types used in conforming XML instance documents. XML schema documents are augmented with annotations that describe the mapping of XML components into tables/columns in a relational database. Annotations are a feature of XML schema that provide for application-specific information to be supplied to programs processing the schema or instance documents.
At least one conventional decomposition product using the XML schemas is limited because it can only map a single item into a table-column pair. The problem is best described by an example of FIGS. 1A-1B, which illustrates an XML document.
The XML document of FIGS. 1A-1B contains branches of a company. Each branch has a name, phone number and address. Branches in the USA are allowed to have sub-branches under them. This is done by the use of element “sub-branches” as a child element of branches and as the next sibling of the element “phone”. In addition, provision is made to accommodate companies that have branches in countries other than the USA, by putting such branches under the element “other-countries”.
The aim is to create an address book of all the branches and sub-branches in the company. The desired result of decomposing the above XML document into a table “branches” of a relational database is shown in FIG. 2. It is quite clear from the expected output that items from various parts of the XML document, with same and/or different element names, such as “name”, “address”, “address1”, and “phone”, are being mapped into the same table-column pair, although they belong to different branches types, namely, USA branches, USASubBranches or NonUSABranches.
For the XML document of FIGS. 1A-1B care has to be taken, when multiple items are mapped into the same table-column pair, to associate the correct branch with the correct address and phone number as there are multiple names, phone numbers and addresses in the document. However, it is not guaranteed that related name, address and phone number may appear sequentially, as is shown in the case of a branch having sub-branches where the sub-branch address appears before the parent branch's address. Therefore, in conventional systems there is a problem of identifying the items in the XML document that belong to the same row of the database table, as we do not want put the phone number of a branch and the address of its sub-branch in the same row. More generally stated, there is a problem in conventional methods for decomposition of XML documents, where multiple items are being mapped into the same table-column, in identifying the items in the XML document that belong to the same row.
While there have been various techniques developed for decomposing and storing of markup based documents, such as XML documents, in a database, there is a need for a simple, optimized, transparent and generic method which will allow decomposition of multiple information items from an XML document into the same table-column pair, without needing dedicated mapping constructs.