1. Field of the Invention
The present invention relates to an apparatus, a method, and a computer program product for managing a structured document by creating an index used for retrieval and retrieving data with the created index.
2. Description of the Related Art
In recent years, structured document databases for storing and retrieving structured document data described in eXtensible Markup Language (XML) and the like have been developed. In general, queries to the structured document databases are performed using a query language called XML Query (XQuery) that is currently under standardization by the World Wide Web Consortium (W3C).
In the XQuery, information at a node level in a document object model (DOM) such as elements and attributes is set as a retrieval object. For example, JP-A 2001-147933(KOKAI) proposes a technology described below for performing information retrieval at a node level in a structured document.
In a method disclosed in JP-A 2001-147933(KOKAI), first, in storing a structured document in a database, a data structure of the document to be stored is analyzed and analyzed information concerning the structure (node) is embedded in vocabulary index information or the like to create an index. The analysis information concerning the structure is information in which a path level that can be represented by XML Path Language (XPath) is regarded as identical structure information (structure template). A retrieval query is analyzed during retrieval to create a query graph and, after performing cost calculation, a plan for execution of a query is created. In creating the plan, the query is analyzed beforehand, constraints on the structure that respective variables should satisfy are calculated in advance, and a search range is limited in performing retrieval using the index to realize a reduction in the number of intermediate candidates.
In general, in a vocabulary index, a system for dividing a text to be registered into a plurality of vocabularies and managing index information in an inverted list format with the divided vocabularies set as units. This is a method that has been used in the field of full text retrieval. This method makes it possible to perform document retrieval by a keyword at high speed by recording document identifiers and occurrence position information as index information. In JP-A 2001-147933(KOKAI), to extend this method to a structured document, element identifiers and structure information (structure identifiers) are added as index information.
Respective pages in the inverted list are often managed in block size units in which efficiency of disk I/O and the like is high. A plurality of pieces of index information are stored in the respective pages. To efficiently arrange the retrieval information, compression and the like of the retrieval information are performed.
A greatest advantage of the method of managing index information with an inverted list is speed in retrieval. In particular, from the viewpoint of disk I/O, a disk cache effect is expected by continuously arranging index information in a list format. Compared with a random arrangement of index information, high-speed page readout is possible. Therefore, compared with a system for managing index information in a tree format such as a B tree, retrieval performance is high, although update performance is inferior.
JP-A 2006-73035(KOKAI) proposes a technique for executing retrieval with an inverted list at high speed. As a characteristic of a method disclosed in JP-A 2006-73035(KOKAI), a retrieval space is narrowed down by preferentially processing vocabularies with low frequencies using frequency information for each of vocabularies. As another characteristic of JP-A 2006-73035(KOKAI), during page arrangement, a range of document identifiers arranged in a page is recorded as a heading of the page, a document identifier to be retrieved and the range are compared during retrieval, and, when the document identifier is not included in the range, unnecessary retrieval for the page can be skipped.
However, in the method of recording a range of document identifiers in a vocabulary index as in JP-A 2006-73035(KOKAI), it is likely that values of document identifiers in a page fluctuate because fluctuation in document identifiers often occur for each of vocabularies. When a range of document identifiers in a page is excessively large, as a result, it is difficult to narrow down a retrieval space in page units and the effect of high speed fades away.
For example, assuming that the number of indexes present in a page is identical, a range of document identifiers of 10 to 100 and a range of document identifiers of 10 to 10000 are compared. In the former range, it is considered to be possible to skip unnecessary information collation in page units at a higher probability. On the other hand, because the latter range is large, it is highly likely that a document identifier to be retrieved is included in the range and the effect of high-speed retrieval by the skip of collation is not usually obtained.
When a structured document is an object of retrieval, it is necessary to take into account both structure information and vocabulary information. However, when a method not taking into account structure information as in JP-A 2006-73035(KOKAI) is applied to creation of a vocabulary index of the structured document and retrieval by the vocabulary index, as a result, a solution space (page) for which retrieval is unnecessary is more often retrieved because of structural constraints.
For example, it is assumed that one thousand elements are present in a page and, among the elements, only ten pieces of index information concerning a specific structure “/title” are present. In this case, when the specific structure is designated and retrieved, if the structure information is not taken into account at all, the other nine-hundred ninety elements are also retrieved. Therefore, wasteful reading occurs.