This application claims the priority of Korean Patent Application No. 10-2002-0065026 filed on Oct. 23, 2002, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Invention
The present invention relates to a method for searching XML data, and more particularly, to a query processing method for searching XML data by which performance of XML query processing can be improved by use of equivalence class and a path expression reduction algorithm.
2. Description of the Related Art
The Extended Markup Language (XML) is a markup language adopted as a recommendation by the W3C (World Wide Web Consortium) in 1998 to compensate for weaknesses of HTML (HyperText Markup Language) (See XML 1.0 W3C Recommendation, Feb. 10, 1998). As shown in FIG. 4, when a document is defined in XML format, the document structure, content and output format thereof are divided so that XML can provide characteristics related to document structuring, such as reusability of the document structure, flexibility of the output format, and search function for the document structure.
As a result, the spread of XML documents will likely accelerate in the future. Accordingly, research for efficiently storing, searching and retrieving information in documents using information in the XML document structure is in progress.
FIG. 5a shows an example of an XML document. The document shown in FIG. 5a is constructed in the form of a tree as shown in FIG. 5b. Thus, it is possible to search or retrieve the document structure. To support the search for XML documents, a database (for example, a Relational DBMS, an Object-Oriented DBMS, or an XML-dedicated DBMS) is required. In the related art, an XML query language is used for searching or retrieving XML data.
The XML query languages include, for example, an XQL (XML Query Language), Query (A Query Language for XML). Each XML query language uses a special specification called an Xpath (XML Path Language) indicating paths of elements or text in the XML document when representing the query. In the Xpath specification, relationships between the elements or nodes are represented using operators ‘/’ or ‘//’, in which the ‘/’ operator represents a child node of a specific element and the ‘//’ operator represents the specific element itself or its descendant node.
In the example of the XML document shown in FIG. 5a, if the query “A//B” is performed, respective nodes of the tree shown in FIG. 5b are traversed to search for nodes having a B element among descendant nodes of an A element. Likewise, if the query “A/C” is performed, respective nodes of the tree are traversed to search for nodes having a C element among descendent nodes of an A element. Herein, “A//B” or “A/C” is defined as a path expression contained in a query.
Specifically, in a method in which XML documents are divisionally stored according to respective nodes of the elements as shown in FIG. 6, processing costs increase in proportion to the number of nodes on the path expression (i.e. the length of the path expression). In other words, to search for a specific element, complex join operations for joining data from two or more respective tables in which the elements are stored are required. Furthermore, the join operations are very critical, which significantly impact the performance of the database. Therefore, maximum reduction of the number of join operations is useful in improving system performance.
For example, a path expression /A/B/A/C/ may be represented as other path expressions //B/A/C, /A//B//C and //B//C which designate the same node using the operators ‘/’ or ‘//’. When processing each of the path expressions, the path expression /A/B/A/C requires three join operations as described in (((AB)A)C), and the path expression //B/A/C requires two join operations as described in ((BA)C). On the other hand, the path expression //B/C requires only one join operation as described in (BC). Thus, the expression //B/C reduces costs for path expression processing as compared with a case where two or three join operations are required. Herein, it should be noted that  is a symbol indicating a join operation.
As described above, it will be understood that the processing of path expressions for XML data is very important when queries are performed. Related art for processing path expressions is disclosed in the following technical documents:
1. C. Zhang, J. Naughton, D. Dewitt, Q. Luo, and G. Lohman, “On supporting containment queries in relational database management systems”, In Proceedings of 2001 ACM-SIGMO conference, Santa Barbara, Calif., 2001;
2. Quanzhong Li, Bongki Moon, “Indexing and querying XML data for regular path expressions”, In Proceedings of 2001 VLDB conference, pp. 361–370, Roma, Italy, 2001; and
3. Divesh Srivastava, Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, Jignesh M. Patel, Yuqing Wu, “Structural Joins: A Primitive for Efficient XML Query Pattern Matching”, In Proceedings of 2002 IEEE Conference on Data Engineering (ICED), San Jose, Calif., 2002.
To support the search function for large volumes of XML documents, the prior technical documents 1 and 2 use the containment relationships between elements for parsing an XML document, storing elements in the XML document into a database in the form of a tuple serving as an independent unit, and processing the path expression. In addition, the prior technical document 3 proposes improved join algorithms for efficiently performing path join operations based on the containment relationship, where the join operations are performed by tree-merge join and stack-tree algorithms.
The prior technical documents have created the groundwork for techniques of processing the path expressions using the join operations. However, there is a related art problem in that search performance can be critically degraded, since the join operations must be performed every time the path expressions become longer or more complex.