International Publication No. WO 98/34179 (PCT/AU98/00050) in the name of Time Base Pty Ltd and published on 6 Aug. 1998 and counterpart U.S. Pat. No. 6,233,592 issued on 15 May 2001 to Schnelle et al. are incorporated herein by cross reference. In these documents, an electronic publishing system is disclosed that provides a sparse multidimensional matrix of data using a set of flat file records. In particular, the computer-implemented system publishes an electronic publication using text-based data. Predefined portions of the text-based data are stored and used for the publication. At least one of the predefined portions is modified, and the modified version is stored as well. The predefined portion is typically a block of text, greater in size than a single word, but less than an entire document. Thus, for example, in the case of legislation, the predefined portion may be a section of the Act. Each predefined portion and the modified portion(s) are marked up with one or more links using a markup language, preferably SGML or XML, The system also has attributes, each being a point on an axis of a multidimensional space for organising the predefined portions and the modified portion(s) of the text-based data. This system is simply referred to as the Multi Access Layer Technology or “MALT” system hereinafter.
Australian Patent Application No. 65470/00 filed on 12 Oct., 2000 in the name of TimeBase Pty Ltd, Canadian Patent Application No. 2323245 filed on 12 Oct., 2000 in the name of TimeBase Pty Ltd, New Zealand Patent Application No. 507510 filed on 12 Oct., 2000 in the name of TimeBase Pty Ltd and U.S. patent application Ser. No. 09/689927 filed on 12 Oct., 2000 in the names of Lessing et al. are incorporated herein by cross reference.
U.S. patent application entitled “Resilient Data Links” filed on 18 Jul., 2001 in the names of Schnelle and Nolan is also incorporated herein by cross reference. In this document, a method, an apparatus and a computer program product for providing one or more resilient links in an electronic document are described. The methodology disclosed is referred to as “MALTlink” hereinafter.
Large or complex text-based datasets are typically hierarchical in nature. In the storage, maintenance and publication of such data, it is common to use a markup language capable of describing such hierarchies. XML is one such markup language that is more commonly used, particularly in the print, electronic or online publishing industries, and for government or public records or technical documentation. XML data is stored typically either in “flat” text files encoded in ASCII, Unicode, or other standard text encoding, or in a “native” XML database.
The flat text files may be part of a document management system. Such a document management system may be based on a relational database. Document management systems deal with a document as a whole and are able to store relevant data about each document. However, document management systems are typically not designed to operate on data (XML elements) within such documents. Consequently, a document management system does not typically operate on all (or even a substantial number of the) XML elements contained in flat text files on which the document managing system is operating. An XML database, in contrast, operates on all XML elements of the XML data that the XML database is storing and, consequently, XML databases must manage large amounts of data and detail. As a result, document management systems have a limited usefulness resulting from a lack of precision and XML databases are overwhelmed by the multiplicity of XML elements that are to be managed.
Attempts have been made to transform XML data into a set of SQL relational database tables. SQL is a database technology that provides a user with powerful query functionality and powerful data management tools. SQL possesses the stability of a mature technology, whereas XML databases are still a relatively immature technology, and thus possess a degree of instability. SQL is a fast and efficient technology, and a wide choice of software and hardware manufacturers offer or support SQL databases.
Tree mapping techniques are typically used to convert XML data into relational databases. Conventional tree mapping techniques, however, often attempt to capture all of the document hierarchy. This is almost never necessary and can result in substantial size and performance penalties in the resulting SQL tables. Such tree mapping techniques typically result in a far larger number of SQL tables than is necessary.
As an example, consider the XML fragment shown in FIG. 1. A classical approach to conversion is to represent the element tree with one table per element type, possibly with an added table to store the tree structure. A correct, and possibly even reversible, outcome results. However, the performance and management advantages (which prompted the conversion in the first place) can be diminished or even lost entirely, because of the size and complexity of the resulting tables.
Thus, a need exists for providing an efficient method for converting an XML document to a set of SQL tables.
According to a first aspect of the invention, there is provided a computer implemented method for converting an XML encoded dataset into a minimal set of SQL tables including the steps of:
identifying at least one hierarchical structure in said XML encoded dataset; and
converting an XML encoded dataset associated with each identified hierarchical structure, wherein for each identified hierarchical structure said converting step includes the further steps of:                determining a node element set for said identified hierarchical structure of said XML encoded dataset, wherein each node element in said node element set is a discrete level of said identified hierarchical structure of said dataset;        determining one or more nodes of said XML encoded dataset each node being an instance of a node element;        allocating to each node a unique node identifier; and        generating an SQL node table containing one or more records, each record corresponding to a respective one of said allocated node identifiers.        
According to a second aspect of the invention, there is provided an apparatus for converting an XML encoded dataset into a minimal set of SQL tables, the apparatus including:
a device for identifying at least one hierarchical structure in the XML encoded dataset; and
a device for converting an XML encoded dataset associated with each identified hierarchical structure, the device including:                a device for determining a node element set for the identified hierarchical structure of the XML encoded dataset, wherein each node element in the node element set is a discrete level of the identified hierarchical structure of the dataset;        a device for determining one or more nodes of the XML encoded dataset, each node being an instance of a node element;        a device for allocating to each node a unique node identifier; and        a device for generating an SQL node table containing one or more records, each record corresponding to a respective one of the allocated node identifiers        
According to another aspect of the invention there is provided a computer program product including a computer readable medium having recorded thereon a computer program for implementing the method described above.