A situation has arisen in recent years where large volumes of information are converted to text. For use of such large volumes of text information, it is effective to extract or retrieve information in accordance with the intended use. For example, if the intended use is in decision-making when purchasing goods or in marketing support, it is desirable to extract or retrieve comments and suggestions regarding goods and services from large volumes of text information.
In order to extract or retrieve text information in accordance with the intended use of the information, it is important to determine sentences that contain target information. This is because extracting information or creating an index for retrieval from sentences that do not contain target information results in noise. A method for classifying text information depending on whether or not target information is included is conceivable as a conventional method for determining sentences that contain the target information. One specific example is a method for classifying arbitrary text data such as a classification method described in Patent Literature (PLT) 1.
The classification method disclosed in PLT 1 is a method for extracting a partial character string having an arbitrary fixed length from text information, further generating a feature vector from a feature quantity of the partial character string, and determining whether or not the text information is classified under a target category, using the feature vector. The classification method disclosed in PLT 1 is a method for classifying, in units of sentences, whether or not text information is target information. The term “sentence” as used herein refers to text generated by separating a character string in text information by a fixed length or by a sentence-end symbol.
Another conceivable method is a method for classifying, not in units of sentences, but in units of topics (hereinafter referred to as “topic units”) composed of a plurality of sentences about the same topic, whether or not text information is target information. The term “topic unit” as used herein refers to text that contains a plurality of sentences and is generated by separating text at a position where the topic changes.
One example of such a method for classification in topic units is a classification method disclosed in PLT 2. The classification method disclosed in PLT 2 is a method for creating a topic vector that represents the importance of content words in each sentence so as to obtain the degree of similarity in topic vectors between two adjacent sentences, and detecting a boundary position between topics based on a change in the degree of similarity. Then, classification is made based on the detected boundary positions.