1. Field of the Invention
The present invention generally relates to relationship building from XML and, more particularly, to extracting relationships from XML documents and creating corresponding relationships for a relational database.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A distributed database is one that can be dispersed or replicated among different points in a network. An object-oriented programming database is one that is congruent with the data defined in object classes and subclasses.
A relational database management system (RDBMS) is a computer database management system that uses relational techniques for storing and retrieving data. An RDBMS can be structured to support a variety of different types of operations for a requesting entity (e.g., an application, the operating system or an end user). Such operations can be configured to retrieve, add, modify and delete information being stored and managed by the RDBMS. Standard database access methods support these operations using high-level query languages, such as the Structured Query Language (SQL). The term “query” denominates a set of commands that cause execution of operations for processing data from a stored database. For instance, SQL supports four types of query operations, i.e., SELECT, INSERT, UPDATE and DELETE. A SELECT operation retrieves data from a database, an INSERT operation adds new data to a database, an UPDATE operation modifies data in a database and a DELETE operation removes data from a database.
One advantage of an RDBMS is its capacity to process great volumes of data, such as micro array data obtained from micro array based experiments. Micro array data describes information related to manufacturing and design of micro arrays as well as information related to experiment setup and execution. Furthermore, micro array data can describe gene expression data and analysis results. Description and communication of micro array data can be performed using a standardized text-based markup language: MAGE-ML (MicroArray Gene Expression Markup Language). Specifically, MAGE-ML is based on XML (extensible Markup Language) and defines all required elements for supporting gene expression data. Accordingly, MAGE-ML represents gene expression data by corresponding XML documents. The XML documents can be mapped to the underlying relational model of an RDBMS. Thus, the micro array data in the RDBMS can be queried for data extraction and navigation using SQL.
One difficulty when mapping MAGE-ML, and more generally, XML documents, to a relational database, is representing the relationships from the XML documents in the relational database. For instance, assume micro array data described in MAGE-ML by hundreds of mega bytes long XML files with tremendous amounts of gene data. These XML files contain hierarchical tree structures in which the various nodes of the tree structures are related. However, during an automated process of mapping the XML files to a relational database, the hierarchical relationships are lost. Further, in a respective relational database there can be multiple relationship paths between the relational tables. As a result, a given SQL query may traverse any one of a variety of relationship paths in order to access and return the requested data. Unfortunately, the result set returned to the user depends on the particular relationship path traversed. By way of example, assume that the relational database includes seven tables A, B, C, D, E, F and G and that it is possible to get from A to C via a first path A->B->C and via a second path A->G->F->E->D->C. Assume now that a first result set is returned for the given SQL query if the first path is taken and that a second result set is returned if the second path is taken. Assume further that according to the relationships defined in the XML files the second path should be taken to determine a correct result set and that the first path leads to an incorrect result set. Thus, in order to guarantee that the correct result set is returned in response to the given SQL query, the relational database needs to represent the relationships defined by the XML files.
Therefore, there is a need for an efficient technique for extracting relationships from XML files and creating corresponding relationships for a relational database.