1. Field of the Invention
The present invention relates to a technique to create correspondences between words or terms included in documents, on the basis of existing document information provided as computer-readable information. More particularly, the present invention relates to a technique to create correspondences between words or terms included in documents in different foreign languages.
2. Related Art
Heretofore, for the purpose of translating a document between different languages or utilizing data in multiple languages, it has been necessary to understand appropriate translations and related expressions in accordance with the purpose. For this reason, it is necessary to find correspondences between words or terms in different languages. To achieve this, existing dictionary can be utilized for frequently used words or terms.
However, although there have been prepared many dictionaries showing correspondences between general terms for different languages and technical terms in the same language, quite often, it is difficult to find dictionaries for technical terms between different languages. In the field of automobile industry, for example, “handle” in Japanese corresponds to “steering wheel” in English in automobile data, but also corresponds to “handle” in English in some other data. Such translation words and related words need to be prepared not only between Japanese and English but also between pairs of other languages.
The reason for the underdevelopment of dictionaries for technical terms between different languages is that there are not many individuals who have a skill set to achieve work for preparing such a dictionary because the work requires knowledge in the specialized field in addition to the language knowledge.
In addition, since just a somewhat understandable translation is not sufficient, it is necessary to select a translation from expressions actually used in target data in order to bring the translation to a practical level. The creation of such correspondence requires considerable cost and time, and the creation of translation words and related words by this method in every case is extremely inefficient.
As a prior art patent literature in this field, the following literature is cited.
Japanese Patent Application Publication No. 2002-91965 relates to a dictionary device provided to a natural language processing system used by multiple users and discloses a system including: a dictionary main body in which multiple technical term dictionaries for respective categories are arranged in a hierarchical tree structure with a general term dictionary as its root node; user dictionary registration means for setting a user dictionary in association with a technical term dictionary desired by the user; and applicable dictionary determination means for determining, when a category targeted for natural language processing is designated, that all technical term dictionaries on a path of the tree structure from the technical term dictionary of the category to the general term dictionary, and all of user dictionaries of a process-requesting user associated with the technical term dictionaries are applicable dictionaries.
Japanese Patent Application Publication No. 2002-269085 relates to a machine translation device having a word graph creation unit, a word graph memory and a search selection unit. For a sentence in an original language that is formed of an inputted character string, the word graph creation unit refers to a translation dictionary including multiple pairs of at least one expression in the original language and expressions in at least one target language, checks the expressions against the expression in the original language, extracts the expression in the target language corresponding to the matched language expression, creates a combination of expressions in the target language in a word graph format and stores the combination in the word graph memory. The search selection unit refers to corpus data in the target language, checks a word string on the word graph stored in the word graph memory against the corpus data and counts the appearance frequencies of the words on the word graph in the corpus data and thereby calculates a score of a translation sentence in the target language that corresponds to the sentence in the original language. The search selection unit thus selects an optimum translation sentence in the target language on the basis of the calculated score.
Japanese Patent Application Publication No. 2004-280316 discloses a language processing system for determining a field to which document data belongs and further performing language processing for the document data by using a technical term dictionary and learning data in the determined field. The language processing system includes a basic dictionary including general language information in multiple fields, and technical term dictionaries including language information in specialized fields. In this language processing system, upon input of document data, an analysis unit calculates a word vector of words included in description contents from the inputted document data with reference to the basic dictionary. A field determination unit calculates similarities between field vectors each characterizing a field and the calculated word vector and thereby determines that the field having the largest similarity is a field to which the inputted document data belongs. Then, a language processing unit performs language processing for the inputted document data with reference to the technical term dictionary in the determined field.
Japanese Patent Application Publication No. 2008-146218 discloses a language analysis technique for achieving precise morphological analysis by correctly dividing technical terms, which are difficult to divide, to extract morphemes and thereby creating a morphological analysis dictionary. From registration data of a translation dictionary between Japanese and a foreign language, this language analysis technique extracts a translation tuple registered not as a pair of one Japanese word and one foreign language word but as a tuple of one Japanese word and multiple foreign language words. The Japanese word in the extracted translation tuple is morphologically analyzed and divided into sub-words or sub-word-strings. Then, a foreign language word corresponding to each sub-word or sub-word-string is identified and the sub-word or sub-word-string corresponding to the found foreign word is registered as a morpheme in the morphological analysis dictionary. Thus, the technical terms are morphologically analyzed based on the registered morpheme information.
Japanese Patent Application Publication No. 2010-55298 discloses a system for providing means to meet demand for text mining or search on document data written in a language other than a native language or a proficient language. The system includes: a first extraction unit configured to extract co-occurring terms co-occurring with a concerned term in the first language from a first language corpus; an output unit configured to output translation words in a second language corresponding to at least one of the extracted co-occurring terms; a second extraction unit configured to extract translation candidates co-occurring with at least one of the outputted translation words in the second language from a second language corpus corresponding to the first language corpus; a weighting unit configured to weight each of the extracted translation word candidates; and a creation unit configured to optimize the weights and to create a translation pair list for the concerned term in the first language in accordance with the optimized weights.
Furthermore, the followings are cited as non-patent literatures.
There is disclosed a technique to list, for a query term, a set of similar terms in a different language by a random walk on a directed graph in which nodes represent terms, in Guihong Cao, Jianfeng Gao, Jian-Yun Nie, Jing Bai, “Extending query translation to cross-language query expansion with markov chain models,” CIKM '07 Proceedings of the sixteenth ACM conference on Conference on information and knowledge management.
There is disclosed a technique to create a feature vector of each word (technical term) by general terms that co-occur with the word with a high frequency and a high degree of association and thereby to list similar words, in Daniel Andrade, Tetsuya Nasukawa, Jun'ichi Tsujii, “Robust measurement and comparison of context similarity for finding translation pairs,” COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics.
The technique using a random walk described in the literature by Guihong Cao et al. appears promising as a technique to list, for a query term, a set of similar terms in a different language but it requires inefficient calculation for creating a graph for each query term due to performing a random walk without taking the structure of the graph into consideration.
In this respect, if an attempt is made to reduce the complexity of calculation by stopping the random walk with a low number of steps, there arises a problem that the technique is no longer appropriate for a term or keyword having a low appearance frequency.