A description of a traditional text retrieval and summarization system containing a taxonomy and tagged text is provided below. First of all, the definitions of “taxonomy”, “tagged text”, and “text retrieval and summarization system” will be given.
A taxonomy is a directed acyclic graph (DAG: Directed Acyclic Graph) comprising multiple semantic classes. Each semantic class is composed of a label and a class identifier and, in addition, has parent-child relationships with other semantic classes. A parent class is a semantic class serving as a superordinate concept relative to a certain semantic class. A child class is a semantic class serving as a subordinate concept relative to a certain semantic class. A label is a character string that represents its semantic class. It should be noted that in the discussion below a semantic class labeled ‘X’ may be represented as “X-Class”.
A class identifier is a unique value indicating a specific semantic class within a taxonomy. Here, an example of a taxonomy will be described with reference to FIG. 18. FIG. 18 illustrates an exemplary taxonomy. In the example of FIG. 18, thirteen semantic classes are represented by ovals, with the labels of the semantic classes noted inside the ovals and, furthermore, class identifiers noted next to the ovals. In addition, in FIG. 18, the arrows denote parent-child relationships between the semantic classes. For example, the class “electric appliance manufacturer” has “C002” as a class identifier, the class “enterprise” as a parent class, and the class “Company A” as a child class. It should be noted that in the description that follows, semantic classes that are at the lowermost level within the taxonomy and don't have a child class are referred to as “leaf classes”.
A tagged text is information that includes at least a body text composed of character strings and a set of tags attached in arbitrary locations within the character strings. It should be noted that in the description below, a tagged text may be described simply as a “document”. FIG. 19 illustrates an exemplary tagged text. FIG. 19 shows an example of two tagged texts, i.e. Document 001 and Document 002. Of the above, Document 001 is composed of body text, i.e. “Company A, a major electric appliance manufacturer, announces a net profit of 10 Billion Yen in its March 2008 financial results”, and tags “Company A”, “March 2008”, and “10 Billion Yen” attached in three places.
Each one of the tags in the documents contains three information items, namely, a class pointer, a start position, and an end position. A class pointer is a class identifier indicating a leaf class within the taxonomy. The start position and end position constitute information representing the location where the tag is attached. For example, the start position and end position are typically represented by the number of characters from the beginning of the sentence when the beginning of the sentence is “0”. For example, the start position of the tag attached to “Company A” is the location of the 9th character, and its end position is the 11th character from the beginning of the sentence.
A text retrieval and summarization system is a system that uses search terms represented by keywords and the like to assemble a collection of tagged text associated with the search terms and summarizes the search results based on the tags contained in the collection of tagged text.
An example, in which a traditional text retrieval and summarization system generates a type of summary called tabular summary, will be described next. For example, let us assume that a user has entered a query ““financial results” AND “announces””. At such time, first of all, the text retrieval and summarization system collects tagged text containing the two expressions, i.e. “financial results” and “announces”, in the body text. Here, it is assumed that Document 001 and Document 002 illustrated in FIG. 19 have been assembled into a collection of matching documents. It should be noted that, as used herein, collections of tagged text that match user-entered queries are referred to as “matching document collections”. On the other hand, collections of tagged text that do not match user-entered queries are referred to as “non-matching document collections”.
Next, based on the tags attached to the collected tagged text, the text retrieval and summarization system selects multiple semantic classes as a point of view for summarization. For example, let us assume that the text retrieval and summarization system has selected “enterprise”, “net profit”, and “Month/Year”. At such time, the text retrieval and summarization system generates the results illustrated in FIG. 20. FIG. 20 shows an example of output from a traditional text retrieval and summarization system. In the example of FIG. 20, a table having rows assigned respectively to Document 001 and Document 002 is created based on the character strings of the tagged portions of Document 001 and Document 002.
In this manner, the text retrieval and summarization system selects several semantic classes from a collection of tagged text obtained based on the search terms and summarizes the search results from the point of view represented by the selected semantic classes.
In order to build such a text retrieval and summarization system, it is necessary to decide what set of semantic classes to retrieve as a point of view from the collection of tagged texts selected based on the search terms. In other words, the problem is to determine the criteria to be used in identifying the semantic classes specific to a collection of user-selected texts. In this Specification, this problem is treated as the problem of semantic class identification.
For example, in connection with the problem of semantic class identification, Non-Patent Document 1 has disclosed a system of facet identification in multi-faceted search. The term “multi-faceted search” refers to a technology, in which tag information called “facets” is appended to data based on various points of view (time, place name, enterprise name, etc.) and only specific data is retrieved when the user specifies the terms for the facets. The system of facet identification disclosed in Non-Patent Document 1 ranks facets based on several evaluation scores in a data set obtained via a user search and selects the data, to which the top K facets are appended.
It is believed that using this facet identification system disclosed in Non-Patent Document 1 can solve the above-described semantic class identification problem. For example, it is contemplated to rank semantic classes attached to texts extracted as search results based on certain evaluation scores in accordance with the facet identification system and retrieve the top K semantic classes with high evaluation scores as a point of view.
However, when the facet identification system disclosed in Non-Patent Document 1 is used, the number K of the semantic classes retrieved as a point of view needs to be specified by the user and, in addition, semantic classes are assessed on an individual basis only, and assessment of combinations of multiple semantic classes is not performed. Accordingly, when the facet identification system disclosed in Non-Patent Document 1 is used, there is a chance that unsuitable combinations of semantic classes may be retrieved. This will be illustrated with reference to FIG. 21 using an exemplary situation where the frequencies obtained in search results are utilized as evaluation scores for individual semantic classes.
FIG. 21 is a diagram illustrating an exemplary situation, in which tagged texts are categorized using tags. The distribution of the tags in the search results is as shown in FIG. 21. In FIG. 21, each row designates a tagged text in the search results. In addition, the columns, except for the first column of FIG. 21, designate semantic classes. Furthermore, the cells of FIG. 21 mean whether or not the semantic classes are included in each tagged text. In each cell of FIG. 21, “1” is listed when a semantic class is included and “0” is listed when a semantic class is not included.
In the example of FIG. 21, individual semantic classes with high frequencies include “net profit”, “enterprise”, and “name”. However, among these, “name” rarely appears in conjunction with other semantic classes, and retrieving these 3 classes as a point of view would not be efficient. Thus, when semantic classes are assessed on an individual basis, there is a chance that undesirable semantic classes may be retrieved depending on the semantic class combinations.
In addition, Non-Patent Document 2 and Non-Patent Document 3 have disclosed generalized association rule mining as a method for semantic class combination assessment. Generalized association rule mining is a technique, in which a taxonomy and a record set are accepted as input, a set of nodes in the taxonomy that are frequently encountered in the record set is selected, and a set of semantic classes with a high correlation between the semantic classes is outputted in the “if X, then Y” format. It should be noted generalized association rule mining is computationally intensive because assessment is performed for every contemplated combination of semantic classes. For this reason, in generalized association rule mining, enumeration trees are created in order to efficiently enumerate the combinations.
Therefore, it is believed that using generalized association rule mining, as disclosed in Non-Patent Document 2 and Non-Patent Document 3, in the facet identification system disclosed in Non-Patent Document 1 will make it possible to determine whether a combination of semantic classes is undesirable.