1. Field of the Invention
The present invention relates to an apparatus, a program product and a method for structured document management that store and search for structured document data having a hierarchical logic structure.
2. Description of the Related Art
Some systems are considered as structured document management systems that store and search for structured document data described by XML (Extensible Markup Language) or the like.
The first system is a system that manages structured document data directly as a text file. The first system has a problem such that when the number and size of data become large, storage efficiency is deteriorated. Also, in the first system, a search utilizing the properties of the structured documents becomes difficult.
The second system is a system that manages structured document data in an RDB (Relational Database). The second system is widely used in backbone systems or the like.
The third system is a system that manages structured document data using an OODB (Object Oriented Database) which is developed for managing structured document data. The third system is, for example, an XML-compliant RDB where the RDB is extended.
Since the RDB stores data therein in a format of a flat table, complicated mapping which relates a hierarchical structure of XML data with a table is necessary. Due to the mapping, the performance is deteriorated if an advance structure (schema) relating to a table is not designed sufficiently.
In recent years, therefore, the fourth system which is an alternative to the first to third systems is proposed. The fourth system is a system that manages structured document data natively. In the fourth system, since XML data having various hierarchical structures are stored without executing a special mapping process, a special overhead is not preset at the time of storage and acquisition. Also, a costly advance schema design is not necessary, and thus the structure of the XML data can be changed freely according to a change in business environments.
Even when structured document data are stored efficiently, it does not make much sense if a means that fetches the stored data is not provided. As the means that fetches the stored data, a query language is used. As SQL (structured Query Language) is used in the RDB world, XQuery (XML Query Language) is established in XML. XQuery is a language for treating XML data like a database, and provides a means that fetches, aggregates and analyzes a data aggregate which matches to the predetermined conditions.
Since the XML data is described in a hierarchical structure where parentage and sibling elements are combined, a means that traces this hierarchical structure is provided. A technique for searching for structured document data including specific elements and a specific structure specified by search conditions while tracing the hierarchical structure of the structured document data stored in such a manner is disclosed, for example, in JP-A 2001-034618 and 2000-057163 (Kokai).
Since the XML data have the hierarchical structure where the parentage and sibling elements are combined, however, the storage efficiency is low.
As the structure of structured document data becomes larger, the number of structured document data stored in a database is larger or the search conditions are more complicated, it takes a longer time to execute the process for tracing between elements composing the hierarchical structures of the respective structured document data. When the number or size of structured document data becomes larger, the stored structured document data cannot be developed on a memory, and the most of them are stored in a secondary storage such as a hard disk.
Particularly, in the system that manages structured document data natively, the structured document data are stored with the hierarchical structure between the elements being unchanged. For this reason, the elements of the structured document data stored on the secondary storage should be frequently accessed in order to check whether a specified element or structure is present as the search condition. In the case of a complicated search condition, the elements are accessed more frequently.
That is, according to the hierarchical structure tracing means disclosed in JP-A 2001-034618 and 2000-057163 (Kokai), while the element data composing the hierarchical structures of the respective structured document data in the database are being traced, the structured document data having the element and structure specified by the search condition are searched for. For this reason, the search cannot be conducted at a high speed. Particularly, as the size of the structured document data is larger, the number of the structured document data to be searched for is larger or the query data (search condition) is more complicated, it is more difficult to heighten the speed of the search process. More concretely, such problems are as follows.
(1) In the case of complicated XQuery, the query includes a plurality of path patterns. When the plural path patterns are verified, traverse to one structured document occurs repeatedly. Particularly, in the case of treating the large size of the structured document data which cannot be on memory, disc I/O with respect to the same page occurs intermittently, and the performance is severely deteriorated.
(2) In the case of XPath which is the subset of XQuery, the performance is deteriorated at the time of high hit. That is, when most of the structured document aggregate is traversed, a lot of disc input/output (I/O) occurs.
As an idea of suppressing data scanning to the same structured document data, a technique of a structured document stream process is present. For example, the following references are included.
(Reference 1) Y. Diao, P. Fischer, and M. J. Franklin. YFilter: Efficient and Scalable Filtering of XML Documents. In The 18th International Conference of Data Engineering, San Jose, February 2002.
(Reference 2) I. Avila-Campillo, D. Raven, T. Green, A. Gupta, Y. Kadiyska, M. Onizuka, and D. Suciu. An XML Toolkit for Light-weight XML Stream Processing, 2002.
The process is for inquiring about Xpath or the like without storing not all the structured document data in a main storage. A system, which converts a plurality of path patterns appearing on plural XPaths into state transitions and processes them, is also proposed. Under present circumstances, however, the following problem arises.
(3) The performance is deteriorated notably on the XPath without high hit. Due to a back track base, an overhead for the CPU process is large. The inquiry process using indexes is difficult due to the property of the process.
As mentioned above, it is difficult to process a plurality of path patterns for the database which stores structured document data therein with minimum disc I/O and a small calculating amount.