1. Field of the Invention
The present invention relates to creating and loading data warehouses from a semi-structured document.
2. Description of the Related Art
A data warehouse can contain data from documents that include a vast quantity of structured data. When a user wants to create and load a data warehouse, the user accesses an initial set of data in a structured format, e.g., XML. Often, a single XML file is actually a collection of several individual documents containing the data which a user must process and store. For example, a single XML file might contain all of the patents filed in 1994. Within this XML document might be sub documents that represent the patents themselves.
As recognized herein, an XML file ordinarily is accompanied by an XML Schema file or a DTD file explaining the XML structure. While this is beneficial data to have, many times these files are missing. Even with a Schema or DTD file, it is not a simple task to create and load a data warehouse having, e.g., a star schema. There are no tools that integrate creating a schema and “shredding” documents, i.e., populate the schema with data in the documents. This is especially true without a DTD or an XML Schema.
Current solutions to the above problem of loading a data warehouse with documents when the structure of the documents might not be known are to create a new data type for XML and allow users to execute XQuery (or something like XQuery) over that data type. As understood herein, this has performance drawbacks particularly when, instead of many small files, a large file must be loaded into the data warehouse. For example, in the case of a single large XML file containing all the issued patents in a given year and thus containing data on which the user might want to operate, e.g., by using an online analytical processing (OLAP) tool, the above-summarized native data type approach is not sufficient.
Another problem that arises from working with semi-structured files like XML is that two files about the same subject might contain a somewhat different structure. Typically this is handled by reformatting the files to create a standard format. However, this plainly entails effort on the part of the user and, hence, is less than optimum.
Accordingly, as understood herein it would be beneficial to provide a user with the ability to create a data warehouse schema, build the needed tables and load those tables into a data warehouse from one or more XML documents of any size where the structure of the given files may or may not be known in advance. It would be desirable to accomplish this in relatively few, relatively simple steps without requiring excessive reading of the XML files.