In recent years, information on the Internet has been explosively increasing, and businesses using big data has been increasing. Due to an increase in big data, a high-speed search technology is desired, and in particular, it has become important to search for a semantic structure in a text document.
Morphological analysis, semantic analysis, or the like is used to analyze a natural sentence used for text search. Morphological analysis is processing for dividing a character string into morphemes and adding information such as a part of speech or an attribute to each of the morphemes. The morphemes obtained as a result of morphological analysis may be treated as words.
Semantic analysis is processing for obtaining a semantic structure of a natural sentence by using a morphological analysis result of the natural sentence. By using the semantic structure that is a semantic analysis result, what the natural sentence means can be expressed as data to be handled by a computer.
The semantic structure includes a plurality of semantic codes that respectively indicate the meanings of a plurality of words included in the morphological analysis result, and information indicating the type of a relationship between two semantic codes. One semantic code may correspond to a plurality of words. The semantic structure can be expressed, for example, by a digraph that is configured of a plurality of nodes indicating a plurality of semantic codes and arcs that each indicate the type of a relationship between two nodes. A minimum elemental structure of the semantic structure is referred to as a semantic minimum unit, and is configured of two nodes and an arc between these nodes.
By performing morphological analysis and semantic analysis on text data included in a plurality of documents, similar document search is realized in which a plurality of documents that have a meaning similar to that of a searching query sentence that is a search request of a natural sentence are searched for by using a semantic structure of the searching query sentence.
A technology is known in which, in similar document search, search keys acting as noise are determined according to the number of documents that match the search keys and evaluation values of documents that correspond to the search keys are recalculated (see, for example, Patent Document 1). A technology is also known for searching for similar documents according to a degree of similarity in a feature vector or a relevance ratio of vocabulary between a search word and documents to be searched, (see, for example, Patent Documents 2 and 3).    Patent Document 1: Japanese Laid-open Patent Publication No. 2015-138351    Patent Document 2: Japanese Laid-open Patent Publication No. 2014-153744    Patent Document 3: Japanese Laid-open Patent Publication No. 2012-3603