The size of the extensible markup language (XML) documents used in to applications is growing and XML files sizes ranging in gigabytes (GBs) are fairly common. This includes data integration where source administrators and external providers can generate large XML file sets in order to isolate and batch the processing of data. Wide availability of multi-core processors presents a natural setting to process these large XML files in parallel.
Parallelism can be achieved via pipeline parallelism and partitioned parallelism. Pipeline parallelism occurs when different operators of an ETL (Extract Transform Load) job are working on different parts of an XML stream simultaneously. This parallelism is a natural technique whenever there are multiple operators (in an ETL job) operating on a XML document stream in a serial manner. Partition parallelism can be achieved when multiple instances of the same operator of an ETL job are working on different parts of an XML stream simultaneously. Each instance of the operator can run on a different processor. However, due to the hierarchical structure of XML, processing XML in parallel by partitioning is inherently a complex task.
Additionally, shredding of large XML documents (which is one of the key operations of an ETL job) is a very slow and expensive operation. XML shredding is the process of relationalizing XML documents, or, for example, taking data from XML documents and storing them in a relational database. Many existing approaches and/or products cannot scale to such large input data, and shredding of large documents is inherently a serial task. Schema validation of such large documents adds to the cost of shredding, and shredding is typically the first step in an ETL job (wherein large documents thereby affect the entire ETL process).
Existing approaches do not provide techniques that enable the shredding and schema validation process to run in parallel. Existing approaches also do not provide modified ETL job definition at job design time or speed-up over serial execution.