1. Field of the Invention
The present invention relates to a structured document searching apparatus that stores structured document data including hierarchized elements and searches through the structured document data in accordance with search criteria, as well as a method and a computer program product therefor.
2. Description of the Related Art
Several systems have been suggested for structured document management by which structured document data described in the Extensible Markup Language (XML) or the like is stored and searched. A system of the first type manages the structured document data as text files without making any changes. With this system, the data storing efficiency is decreased in accordance with an increase in the number of data items and the size of the data. In addition, with this system, a search that takes full advantage of the structured documents becomes difficult. A system of the second type stores and manages the structured document data in a RDB (relational database). This system is widely used in backbone systems. A system of the third type adopts an object-oriented database (OODB), which has been developed for structured document data management. The database of this system adopts an extended RDB, such as a XML-compliant RDB. Because the data is stored in the RDB in the form of a flat table, complex mapping is required to associate the hierarchical structure of the XML data of the like with the table. Due to this mapping, the performance would be lowered without a preliminarily well-designed structure (schema) of the table. Recently, an alternative to the above three systems has been suggested. The system of the fourth type performs native structured document data management. In accordance with this system, the XML data of various hierarchical structures is stored without any particular mapping, and thus overhead is reduced at the time of data storing or retrieving. Furthermore, costly preliminary schema designing becomes unnecessary, and the XML data structure can be freely modified in accordance with changes in the business environment.
Even when the structured document data is efficiently stored, there is no point without a means for retrieving the stored data. A query language offers a data retrieving means. In the field of RDB, it is the Structured Query Language (SQL). In the same manner, in the field of XML, the XML Query Language (XQuery) has been developed. The XQuery is a language with which the XML data is dealt with as a database. The language offers a means for retrieving a set of data items that match a criterion and compiling and analyzing the data. In addition, because the XML data has a hierarchical structure in which parent-child and sibling elements are arranged, a means for tracing in this structure is also offered. Reference documents that disclose a technology of searching for structured document data that contains specific elements and a specific structure that are designated by search criteria from the stored structured document data by tracing elements in the hierarchical structure include JP-A 2001-034618 (KOKAI) and JP-A 2000-057163 (KOKAI).
There is a problem in the XML data, however, that the hierarchical structure of the data containing parent-child and sibling elements lowers the storage efficiency. Furthermore, as the structured document data has a larger structure, as the number of structured document data items stored in the database increases, or as the search criteria become more complex, it takes longer to perform the process of tracing the elements that constitute the hierarchical structure of the structured document data. In addition, as the number of structured document data items or the size of the data increases, it becomes more difficult to expand the stored structured document data on the memory, and thus in most cases the data has to be stored in a secondary memory such as a hard disk. Especially in a system of the native structured document data management, the structured document data is stored with its hierarchical structure of the elements as it is. For this reason, when checking whether an element or structure designated as a search criterion is present, accesses have to be frequently made to search among the elements of the structured document data stored in the secondary memory. The frequency of the accesses would further increase if the search criteria become more complex. With a means of tracing the hierarchical structure as disclosed in JP-A 2001-034618 (KOKAI) and JP-A 2000-057163 (KOKAI), a structured document data item that contains an element or a structure designated by the search criteria is searched for by tracing the element data of the hierarchical structure of each structured document data item in the database. This prevents the search from being performed at high speed. Especially when the size of the structured document data or the number of search target structured document data items is large, or when the query data (search criteria) is complex, the high-speed search process becomes difficult. This is explained in more detail below.
(1) With a complex XQuery, multiple path patterns are included in the query. When checking the multiple path patterns, traverses to the same structured document data item are repeatedly generated. Especially when dealing with the data of a size that cannot be expanded on the memory, disk input/output to and from the same page intermittently occurs, and the performance is significantly deteriorated.
(2) With a XPath, which is a subset of the XQuery, the performance is lowered when the hit rate is high. If traverses occur to a large portion of the structure text set, a great amount of disk input/output is caused.
As a technique of reducing the scanning of the same structured document data item, a structured document stream process has been developed. For example, Y. Diao, P. Fischer, and M. J. Franklin, YFilter: Efficient and Scalable Filtering of XML Documents, in the 18th International Conference of Data Engineering, San Jose, February 2002; and I. Avila-Campillo, D. Raven, T. Green, A. Gupta, Y. Kadiyska, M. Onizuka, and D. Suciu, An XML Toolkit for Light-weight XML Stream Processing, 2002, disclose such a technique. According to these reference documents, a query such as an XPath is processed, without storing the entire structured document data in the main memory. A system of processing a query by performing a state transition onto multiple pass patterns that appear in multiple XPaths is also suggested. In reality, however, the following problems arise.
(3) With an XPath that is not hit at high rate, the performance is deteriorated. Because of its backtracking algorithm, overhead is increased in the CPU processing. Due to the characteristics of the processing, an index-adopted query is difficult to process.
As discussed above, it is difficult to process multiple pass patterns in a database that holds the structured document data with the minimum disk input/output and by a small amount of calculation. A technology developed in light of the above problems is disclosed in JP-A 2007-226452 (KOKAI). With this technology, the syntax of the structured document data is analyzed, and structural information included in the structured document data is stored by converting it to structure stream data that is one-dimensional array data by use of the structure guide data. In this manner, the structured document data can be compressed to about 1/20 the size of the original, and thereby the disk input/output can be largely reduced. This increases the storage efficiency of the database. The technology of JP-A 2007-226452 (KOKAI) does not use backtracking but repeats fundamental definitive operations, which means that overhead is reduced in the CPU processing. As a result, the search process using query data such as complex XQuery and multiple XPaths, which has been difficult to speed up, can be performed at dramatically enhanced speed. With the technology of JP-A 2007-226452 (KOKAI), the structural data and text data are perpetuated under a concept of streams, while maintaining the order of elements. The order of the structural data can be easily compressed and encoded, and therefore higher speed and lighter weight are expected.
To process the XQuery at high speed, the scanning range of the text index and the XML data should be narrowed down as much as possible by use of text conditions and structural conditions. However, with an in-line relay such as in JP-A 2007-226452 (KOKAI), it is difficult to narrow down the scanning range of the text index by use of a structural condition, and therefore all the text indexes related to a text condition need to be scanned. This may increase the disk input/output cost. In addition, when the hit rate is high with the text index, the intermediate data of a large size needs to be held, which may increase the memory cost.