Use of the Extensible Markup Language (XML) has become a popular and useful technique for representing and exchanging information of any kind, such as exchanging information among computer program applications and services, because XML data is self-descriptive (i.e., it contains tags along with data). Consequently, effective and efficient storage and manipulation of XML data has likewise become useful and necessary. Thus, some databases have been augmented to support the storage and manipulation of and access to XML data.
In recent years, there are many database systems that allow storage and querying of XML data. Though there are many evolving standards for querying XML, all of them include some variation of XPath. However, database systems are usually not optimized to handle XPath queries and their query performance leaves much to be desired. A mechanism for indexing paths, values and order information in XML documents is described in U.S. patent application Ser. No. 10/884,311 filed by SIVASANKARAN CHANDRASEKARAN et al., entitled “INDEX FOR ACCESSING XML DATA” (“the Chandra application”), the entire content of which is incorporated by reference in its entirety for all purposes as if fully disclosed herein. However, this index can be large and should be loaded efficiently, especially when the set of XML documents being indexed is also large. For example, it is not uncommon for an XML database to store and manage millions of XML documents whose sizes could be on the order of megabytes.
With most database systems, the loading of these indexes is not optimized to take into account parallelism techniques. Thus, this lack of parallelism leads to extracting each node information from the XML documents and populating the XML index in a serial fashion, an approach that does not scale well when the document set is large. Even in systems that do take limited advantage of parallelism techniques, for large XML documents there is no parallelism among different XML nodes within the same XML document. For example, various parallelism techniques may be employed to parallelize scanning the base data structures in which XML documents are stored (e.g., as part of the base structure creation process) and to parallelize inserting entries into the XML index (e.g., as part of the index creation process). However, such approaches are limited in their scope and effectiveness because such approaches still do not completely overcome processing bottlenecks in the context of the index loading procedure.
Hence, based on the foregoing, there is a need for techniques for efficiently and scalably loading XML indexes.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.