This invention relates to text information extraction devices and methods, whereby the information is extracted from texts such as abstracts of technical papers, for classifying the information or obtaining a database therefrom. This invention further relates to text similarity matching devices and text search systems and methods, whereby semantic similarity of texts contained in a database including texts is matched or collated such that similarity information can be searched for in the database to realize reliable and efficient textual information searches.
Examples of databases including texts from which information is searched for include patent literature, technical books and papers. Such information searches are generally effected by one of the three methods: (1) search by means of keywords; (2) search by means of pattern matching of the words of texts; and (3) search method by which semantic similarity of texts are utilized.
Keyword search and pattern matching search are well known. In the case of these search methods by means of words, synonyms and near synonyms are also searched in order to prevent occurrences of oversight. In the case of the search method based on judgments upon semantic similarity of texts, texts may be subjected to morphological analysis (analysis of morphemes) and parsing (syntactical analysis), as taught by Japanese Laid-Open Patent (Kokai) No. 64-21624, such that the words and syntactic relationships therebetween, as well as the synonyms, near synonyms, together with conceptual information of words thus obtained via such analysis, are also searched for. Further, although not based upon semantic similarity judgement search method, an article by Takamatsu, Kusaka, and Nishida: "Automatic Extraction of Relational Information from Technical Abstracts", Journal of Information Processing Society of Japan, vol. 25, No. 2, March 1984, discloses a relevant method for extracting relations of terms from patent abstracts.
Conventional search methods, however, have the following disadvantages.
Keyword searches tend to produce superfluous search results and, on the other hand, to overlook essential results. Thus, analysts who are versed in the keyword system are required to devise an ingenious logical formulae for the keyword search. This is a heavy burden on the analysts.
The search method via semantic similarity judgement is meant to reduce the burden on the analysts. However, this search method has hitherto tried to judge semantic similarity on the basis of the conceptual meanings of words. The conceptual meanings or concepts of words, however, can be understood only by a small number of people, and clear definitions of concepts are difficult to give. In addition, it is not clear how such concepts of words should affect the semantic similarity judgments.
In order to obtain good semantic similarity judgments, concepts appearing in the process must be clarified one by one by human analysts. Thus, such search method can practically be implemented only for a small amount of text. For a large database such as patent literature, an inordinate amount of time and labor are required for practicing such search method, since the search system must usually be constructed by a small number of system developers. Thus this search method is not practical.