FIG. 2 illustrates an example of a tagged document. The term “writing” herein refers to data that contains at least a document number which is a unique identifier and a character string to be searched (body text). A “tag” refers to data that is attached to one or more words within a document. FIG. 2 illustrates a document example which contains a (Japanese) character string “” (in English, “Taro Yamada, the president of ABC Industry, . . . ”), illustrating an example of a “company name” tag attached to the first to fifth characters “” and an example of a “person's name” tag attached to the seventh to tenth characters “”. A character string that describes a tag, such as “company name” and “person's name”, is referred to herein as a tag name. A “word” herein refers to a partial character string of body text that is created in accordance with some fixed rule such as morphological analysis or N-gram (in which a character string is broken into N-character fragments).
A document management & retrieval system that performs document management and retrieval on such a tagged document is provided with a function of attaching/detaching a tag to/from a partial character string within a document and a function of searching a document by phrase that uses a tag. The document search by phrase that uses a tag means a function with which a string of sequential characters containing a tag name and a character string is input and a set of documents containing the phrase is output. As an example of the phrase that uses a tag, “[company name]  [person's name] ([person's name] of [company name])” is given. In this syntax, a character string enclosed by “[” and “]” is regarded as a tag name. When regarded as a search query, this phrase means that a document in which an arbitrary word with a “company name” tag attached thereto, a word “ (of)”, and an arbitrary word with a “person's name” tag attached thereto appear in succession is to be returned.
As a method of implementing such tag-based document management & retrieval, one is known in which a tagged document is expressed in a description format that has a hierarchical structure, such as Extensible Markup Language (XML), to utilize a hierarchically structured-document search device XML data base (XMLDB) (see, for example, Japanese Unexamined Patent Application Publication (JP-A) No. 2005-18811, which is hereinafter referred to as Patent Document 1).
An example of XML is described with reference to FIGS. 3 to 5. FIG. 3 illustrates an example in which a tagged document is expressed in XML, FIG. 4 expresses a part of this document as a tree structure based on the inclusive relation between tags, and FIG. 5 illustrates a table for managing hierarchical information.
In FIG. 4, elliptic nodes and rectangular leaf nodes signify tags and text items, respectively, and edges between those nodes signify the presence of an inclusive relation between tags or between a tag and a text item. Information called a path hierarchy layer is also written in FIG. 4 under each node. The path hierarchy layer of each node is information indicating the position of the node within the document. Numbers indicating the node position are written along with delimiters (“.”) as the path hierarchy layer. For example, a path hierarchy layer “1. 1. 3” is attached to a “person's name” node of FIG. 4, which means that this node is “the third node under the first node (“body text” node) under the first node (“document” node)” when viewed from the root.
The hierarchical information is managed in a table as the one illustrated in FIG. 5. This table, however, shows logical relations and the information may be actually expressed with the use of a plurality of tables. In the table illustrated in FIG. 5, node IDs, document numbers, text items, tag names, and path hierarchy information are managed for nodes within a document set. A node ID is an identifier unique among all nodes within the document set. A document number is an ID indicating a document that contains the node in question. A text item is a character string contained in a leaf node and, for a node that is not a leaf node, “NULL” is input. A tag name is the tag name of each node and, for a leaf node, “#text” is input. A path hierarchy layer means the path hierarchy layer of each node.
A method of searching such information is described taking as an example the operation of the search device disclosed in Patent Document 1.
For instance, when a phrase “[company name]  [person's name] ([person's name] of [company name])” is given as a query, the search device first breaks up the query into a plurality of search criteria. This query is broken into three criteria: A) that a “company name” tag is contained; B) that a word “ (of)” is contained; and C) that a “person's name” tag is contained. The search device next refers to the table illustrated in FIG. 5 with each of the criteria as a key, to thereby obtain a list of nodes whose tag name is “company name” (List A), a list of nodes whose text item is “ (of)” (List B), and a list of nodes whose tag name is “person's name” (List C). The search device subsequently compares the nodes on List A, List B, and List C, picks up combinations of nodes that have the same document number, and picks up a combination in which the positional relation of the nodes is such that a “company name” node on List A, a “ (of)” node on List B, and a “person's name” node on List C appear sequentially in the same order as in the query. The positional relation is determined by comparing path hierarchies. In the case of this query, a “company name” node, a “ (of)” node, and a “person's name” node are sibling nodes, and the search device creates a search result from nodes that meet the following three criteria:
Criterion 1) the path hierarchy layer of the “company name” node, the path hierarchy layer of the “ (of)” node, and the path hierarchy layer of the “person's name” node match except their final numbers;
Criterion 2) the final number of the path hierarchy layer of the “ (of)” node equals the final number of the path hierarchy layer of the “company name” node plus 1; and
Criterion 3) the final number of the path hierarchy layer of the “person's name” node equals the final number of the path hierarchy layer of the “ (of)” node plus 1.
However, this method has two problems. A first problem is that adding a tag requires an update of the path hierarchy, which prolongs the processing time. FIG. 6 illustrates an example of a change made to the path hierarchy due to the addition of a tag. In FIG. 6, which is about an example of adding a “person's name” tag to a document, the document structure before the addition is illustrated on the left-hand side whereas the document structure after the addition and the range of the resultant path hierarchy update are illustrated on the right-hand side. The update range on the right-hand side shows that the nodes within a range indicated by the dotted line need a path hierarchy update. A change to even a part of a document thus requires great changes in path hierarchy because a path hierarchy uses the overall hierarchical structure of the document to express a node position.
A second problem is that a search takes time when a phrase that consists solely of common terms and frequently appearing tag names is used as a search query. With common terms and frequently appearing tag names as search criteria, a large number of nodes are found in a node search conducted for each of the criteria separately, and the document numbers and positional relations of the large number of nodes have to be checked, which lowers the search speed. For instance, in the case of a query “[company name]  [person's name] ([person's name] of [company name])”, the query is broken into a criterion that a “company name” tag should be contained, a criterion that a word “ (of)” should be contained, and a criterion that a “person's name” tag should be contained and, for each criterion, a list of nodes that meet the criterion is created. However, because each criterion is too general, a large number of nodes are found and checking positional relations takes very long.
A document management & retrieval system using XMLDB indexes the hierarchical structure of a document as well and thus takes time to update a tag (addition or removal) or to finish a search. Accordingly, as an alternative method of implementing tag-based phrase search, using an inverted index which is utilized in a full-text search index, instead of indexing the hierarchical structure, is considered.
FIG. 7 illustrates an example of an inverted index. In a data structure indicated by (a) of FIG. 7, inputting a word as a key yields a list holding the number (frequency) of documents that contain the word, the document numbers of the documents that contain the word, and where in the documents the word appears (appearance position, expressed as the number of characters counted from the top of the document) (hereinafter referred to as document list). To accomplish tag-based phrase search with the use of an inverted index, an inverted tag index indicated by (b) is used in addition to the normal inverted index indicated by (a). In the index of (b), as in the case of a word, inputting a tag of a tag name yields a list holding the number (frequency) of documents that contain the tag, the document numbers of the documents that contain the tag, and information indicating where in the documents the tag appears (start point and end point, expressed as the number of characters counted from the top of the document) (hereinafter referred to as tag document list).
Using this index enables attaching or detaching a tag and thus performing a tag update by adding or removing only the relevant part of the inverted tag index.
However, this method, too, has the issue of processing time in a search where the search query used is a phrase that consists only of common terms and frequently appearing tag names. For instance, when a phrase “[company name]  [person's name] ([person's name] of [company name])” is given as a query, a retrieval system that has this index breaks up the query into A) that a “company name” tag is contained, B) that a word “ (of)” is contained, and C) that a “person's name” tag is contained, as the device described in Patent Document 1 does, and refers to each inverted index. As in the case of XMLDB, because each criterion is too general, a very long document list is found for each criterion and checking positional relations takes time.
A method called nextword index is one way to speed up phrase search where the search query consists of common terms by cutting the length of a document list (see H. E. Williams, J. Zobel and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems, 22(4), pp. 573-594, 2004, hereinafter referred to as Non-patent Document 1). A nextword index has a data structure in which a document list of a common term high in frequency is broken up based on what word appears next (to the “right” on the premise that the documents are written horizontally).
FIG. 8 illustrates a data structure example of a nextword index. In the nextword index, a word is used as a key, a set of words that appear to the right of the key word (nextwords) is stored, the key word is paired with one of the nextwords to obtain a set of documents in which the two words appear next to each other, and a document list of the set of documents is referred to.
FIG. 9 illustrates an example of an index. In this example, “ (Yamada: a surname)” and “” are registered as nextwords of a word “ (of)”, and a document list of documents that contain “” and a document list of documents that contain “” are registered for the respective nextwords. In the following description, a key including two words (or criteria) as described above is expressed as “A→B” (for example, “”), and A and B are referred to as a primary key and a secondary key, respectively.
A retrieval system disclosed in Non-patent Document 1 improves the search speed by using this nextword index for a word that is high in frequency. For instance, when a phrase “ (Yamada of abc Industry)” is input as a search query, and “ (abc Industry)” is a low-frequency word whereas “ (of)” is a high-frequency word, this retrieval system performs a search as follows. First, a normal inverted index is referred to with respect to the low-frequency word to obtain a document list for “”. Next, a nextword index is referred to with the use of a key “” to obtain a document list for the high-frequency word. Those two document lists are compared to output a set of documents that are common to the two and have the same appearance position as in the query. According to nextword index, document lists can thus be read with the adjacency relation between two words as a key, with the result that the search speed is improved.
However, this method is to be used for a simple phrase search and, when tagged documents are the target, has a problem in that tag update processing takes long.
FIG. 10 is a diagram illustrating that tag update processing takes time in a retrieval system that uses a nextword index. Illustrated here is a range in which an update is necessitated when a tag is added to or removed from a phrase “ (Yamada of abc Industry)”.
In FIG. 10, (a) illustrates the character string “” with “noun” and “company name” tags attached to “ (abc Industry)”, a “jyoshi (translator's comment: jyoshi is a particle in Japanese grammar)” tag attached to “ (of)”, and a “person's name” tag attached to “ (Yamada)”. Eight dotted-line arrows of (a) each signifies an adjacency relation key created in the nextword index. In FIG. 10, “” is low in frequency and stored in a normal inverted index.
Consider a case where a “affiliation” tag is added to the word “” out of this phrase. This newly generates relations that are indicated by solid-line arrows of (b) and, accordingly, parts corresponding to a key “noun→affiliation”, a key “company name→affiliation”, a key “affiliation→”, and a key “affiliation→noun” have to be updated.
Consider another case where the “jyoshi” tag attached to “” is removed. Then, similarly, relations indicated by solid-line arrows of (c) have to be removed. Specifically, document lists for a key “noun→jyoshi”, a key “company name→jyoshi”, a key “jyoshi→”, and a key “jyoshi→proper noun” are referred to, and relevant parts need to be modified.
Nextword index is thus designed without taking into account attaching a tag and, when simply applied to tagged documents, has a problem in that many places need to be updated, thereby prolonging tag update. This is due to the fact that, when a tag is used for a secondary key, reference with respect to a tag is made in a discrete manner.