1. Technical Field
The present invention relates to finding partition boundaries in markup language documents, and more specifically, to finding partition boundaries in markup language documents to achieve efficient parallel processing of the markup language documents for extract, transfer, load (ETL) processes.
2. Discussion of the Related Art
Typically, large extended markup language (XML) documents, which can be on the order of a few or several gigabytes (GB) in size, are used to store information for further processing. The process for extracting data from an XML document for storing in a database (e.g., a data warehouse) is referred to as an extract, transform, load (ETL) process or ETL job. In particular, an ETL process involves extraction of data from one or more sources, transforming the data to fit the operational needs of the database, and loading the data into the database.
The processing of such large XML documents can be very time consuming when carried out by a single processor. Parallel processing of a large XML document (i.e., simultaneous processing or processing in parallel of portions of an XML document by one or more processors) can be utilized to more efficiently process the document.
Two known examples for parallel processing large XML documents are: (1) direct splitting of an XML document into multiple parts at fixed locations; and (2) parsing an XML document (either using a full parse or a shallow parse) to determine appropriate partition points within the document, and then partitioning the XML document at such points for parallel processing of such partitioned portions.
The first technique becomes nonfunctional in scenarios in which an XML document has character data (CDATA), a comment section, a nested node definition and/or some other section that must remain continuous and not split or partitioned. In particular, direct splitting of such an XML document at arbitrarily fixed locations (i.e., locations that have not been predetermined as appropriate partition points) can result in splitting of a section that must remain continuous, which would result in an incorrect or inaccurate processing of data or a failure to read the markup language in an ETL process.
The second technique requires parsing of the XML document in order to obtain precise and accurate partitions, and this can be very time consuming depending upon the size of the document which limits the benefits of parallel processing.