1. Field of the Invention
The present invention relates to an index structure capable of allowing users to effectively retrieve data represented by extensible markup language (see T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0 (second edition). W3C) that is the de facto standard for the representation and exchange of data on the Internet, and a method of managing the same, and more particularly to a method of processing XML queries using an adaptive path index, which utilizes frequently used paths. Thus, the query performance for XML data can be significantly improved and storage space can be reduced compared with conventional path indexes.
2. Description of the Prior Art
The emergence of the Internet has dramatically increased the amount of data of all kinds available electronically. To overcome the limitations of Hyper Text Markup Language (HTML) and solve the complexity of Standard Generalized Markup Language (SGML), extensible Markup Language (XML) was proposed by the World Wide Web Consortium (W3C) in 1998.
XML can describe a wide range of data, from regular to irregular, from flat to deeply nest, and tree shaped to graph shaped structure. Due to the flexibility of XML, XML is rapidly emerging as the de facto standard for the representation and exchange of data on the next generation Web application.
Various query languages for retrieving data represented in XML have been proposed. XML query languages, such as XPath (see J. Clark and S. DeRose. XML path language (Xpath) version 1.0. W3C Recommendation) and XQuery (see S. Boag, C. Chamberlin, M. Fernandez, D. Florescu, J. Robie, J. Simeon, and M. Stefanescu. Xquery 1.0: An XML query language. Working draft), use path expressions to traverse an irregular structure consisting of XML elements. Thus, the navigation of irregularly structured XML data is one of essential components for processing XML queries. Since the elements may be scattered at different location in the disk, the performance of processing an XML query results in significant degradation. Furthermore, when an XML query consisting of a partial matching path expression is processed, all elements constituting XML data must be traversed, which is very inefficient.
Meanwhile, a structural summary and a path index speed up the XML query evaluation by allowing only the relevant portions of XML data to be traversed with respect to a given path expression. As a result, structural summary extraction and path index generation techniques have received a lot of attention as schemes of improving the XML query performance. Goldman and Widom developed a path index that is called “strong DataGuide” (see R. Goldman and J. Widom. DataGuides: Enable query formulation and optimization in semistructured database. VLDB 1997.). This index was proposed to extract the structural information of Semi-structural data such as XML data, and records simple paths starting from a root element without overlapping. The building algorithm of the strong DataGuide emulates an algorithm for converting Non-deterministic Finite Automata (NFA) into Deterministic Finite Automata (DFA)(see J. E. Hopcraft and J. D. Ullman. Introduction to Automata Theory, Language, and Computation. 1979). Accordingly, this scheme is problematic in that, in the case where given XML data has a complicated graph structure, the size of the strong DataGuide may become larger than that of original XML data.
Milo and Suciu proposed another index called “1-Index” (see T. Milo and D. Suciu. Index structures for path expression. ICDT. 1999). 1-Index is identical with strong DataGuide in that information of all paths is maintained. 1-Index generates indexes using backward stimulation and backward simulation which are originated from the graph verification area. This method is that two nodes are integrated into a single node when the two nodes exist and two path sets starting from the two nodes are identical with each other. This method allows a graph, such as a non-deterministic finite automata, to be obtained, unlike strong DataGuide. However, when a given input graph takes a tree form, 1-Index and strong DataGuide are identical with each other. Accordingly, 1-Index can be considered a non-deterministic variation of strong DataGuide.
In the field of object-oriented database, to support frequently used reference chains between two object instances, Access Support Relation (ASR) has been used (see A. Kemper and G. Moerkotte. Access support relations: An indexing method for object bases. Information Systems, 17(2): 117-145, 1992). However, ASR materializes the reference chains of arbitrary lengths, and is problematic in that it can support only predefined subset of paths.
Cooper et al. proposed Index Fabric that is conceptually similar to strong DataGuide in that all paths starting from a root element are maintained (see B. Cooper, N. Sample, M. J. Franklin, G. R. Hjaltason, and M. Shadmon. A fast index for semistructured data. VLDB. 2001). Index Fabric encodes the paths of elements having data values to be converted into strings, and maintains them using a string index, such as Patricia Trie. However, Index Fabric is disadvantageous in that a parent-child relationship among elements cannot be maintained. Accordingly, Index Fabric is ineffective in partial matching path expressions that must employ a parent-child relationship.
Many queries on XML Data have the partial matching path expressions because user of XML data does not take the structure of the data into account, and intentionally uses a partial matching path expression so as to obtain desired results. The above-described XML path indexes, such as strong DataGuide, 1-Index, Index Fabric, etc., record all paths starting from a root element, so the index must be exhaustively traversed, thus deteriorating performance. Furthermore, these path indexes are generated only through the use of data. Thus, they do not take advantage of query workload to effectively process frequently used path expressions.