1. Field of the Invention
The present invention is directed to parallel processing of input data and serial loading of the processed hierarchical data.
2. Description of the Related Art
Data in Extensible Markup Language (XML) format (“XML data”) allows applications to exchange data, for example, for integration purposes. Frequently, there is need for XML data to be stored in relational tables in databases (i.e., an example of a data store). Parsing XML data and extracting data elements is resource intensive, any loading of XML data may be very slow. In many cases, the parsing of XML data causes a bottleneck when a high volume of XML data needs to be processed.
Current database systems and applications rely on external scripts and/or Extract, Transform and Load (ETL) programs to load XML data into a database. Unfortunately, external scripts and ETL programs are unable to interact with the core database system through which the database may be accessed. Therefore, the external scripts and ETL programs work through established programming interfaces to load data into the database. These external programs are also serial processors of XML data, and hence are likely to show poor (e.g., slow) performance in high volume situations. Moreover, current solutions are unlikely to be easily customizable. In addition, current solutions typically provide minimal support for error correction and restartability of operations. The fact that XML parsing tends to be resource (e.g., time and memory) consuming adds to the poor performance of current solutions, especially in bulk data load situations.
A “shredding” process is a process of identifying data elements present in XML data and of assembling the data elements into flat tuples (i.e., “records” or “rows”) that may be inserted into a table of a database. Current solutions use client side shredding processes, which have poor performance (e.g., they are very slow). The current solutions generally involve the generation of Structured Query Language (SQL) INSERT statements by client programs, and then the SQL INSERT statements are executed to inserts tuples into a database through a client Application Programming Interface (API), such as JAVA® Database Connectivity (JDBC) like JDBC or Open Database Connectivity (ODBC).
In UPSERT type operations, SQL generation becomes more difficult for client programs because the client programs have to query the database for an object's existence and then generate either an UPDATE SQL statement or an INSERT SQL statement. UPSERT operations may be performed to load data (“input rows”) into a table. In a typical UPSERT operation, when an input row matches a primary key of an existing row in a table, that input row is designated as an update row and is used to update a matched existing row, and when the input row has a new primary key, the input row is designated as an insert row and is inserted into the table. Again, these client side solutions do not work very well in bulk loads, especially for error handling and load restartability.
On the other hand, many database systems implement a special program referred to as a “database loader” to transfer large volumes of data into a database. For example, one loader program is a Red Brick® Table Management Utility (TMU) for the IBM® Red Brick® Warehouse, a relational database optimized for dimensional analysis. For more information on the TMU, see the IBM® RedBrick® Table Management Utility (TMU) Reference Guide Version 6.2 available from International Business Machines Corporation.
A typical database loader has knowledge of the internal structures of the database and has direct access to the physical storage areas of the database. A database loader typically allows data to be loaded into a database in flat or delimited formats. Delimited formats are those in which field values in each row of an input file are separated by special characters (e.g., ‘|’) and each row is separated by another special character (e.g., carriage return/line feed). Flat formats are those in which the field values are of exact lengths and, hence, the entire row is of an exact length. A database loader also provides other functionalities, such as, duplicate handling, optimized index building, enforcing referential integrity, and maintaining materialized views defined on the table. These other functionalities are not easily available to existing client side solutions. Additionally, most database loaders also run in parallel configurations.
Thus, there is a need in the art for improved loading of hierarchically structured data (e.g., XML data) into a database.