Documents having multiple descriptions in the same language and that share the same content frequently employ terms in those descriptions that differ depending upon the degree of specialized knowledge the authors have about the topic and the different social strata, such as sex or age groups, to which the authors belong. Even if the descriptions are about a common topic, terms used by a non-expert and terms used by an expert in their respective expression domains can be quite different.
It is an object of the present invention to provide a new and improved method of, apparatus and other necessary technologies for detecting terms used by a non-expert that correspond to what are meant by the terms used by an expert and, conversely, detecting the terms used by an expert that correspond to what are meant by the terms used by a non-expert between such different domains.
A typical example of technology for converting documents in different domains is a translation machine. Technology that makes a computer perform the task of a translation machine has been known for some time. A translation machine automatically translates a document written in a natural language into another natural language with a computer program using a term database, a program for processing grammatical rules, databases of usage and sentence examples, and other system specific components. Such technology has already been put to practical use, and there are commercial language translations software products for personal computers. Some translation services are also provided on the Internet. In addition, small hand-held devices for word-by-word translation are widely available. A word-by-word translation machine converts one word in a certain language into a word in another language with an identical meaning. Basically, precompiled dictionaries are stored in a storage device, and an input word is converted into a corresponding word in another language. These conventional technologies have a precondition for converting documents from one domain to another domain; namely, a sentence in one domain must be known to correspond to a sentence in the other domain and a word in one domain must be known to correspond to a word in the other domain.
Paraphrasing research for converting a difficult expression into an easier expression in an identical language have already been published. For example, there is reported research by Atsushi Fujita, et al. (2003) and Masahiro Murayama, et al. (2003). In the research concerning “paraphrasing,” the basic technique is to find expression patterns to be replaced by predetermined expression patterns according to pattern matching rules. Other approaches in language translation utilize statistical and/or probabilistic models. These model-based approaches initially prepare a pair of data sets, in different languages, having contents that are known to be the same. Next, based on information, such as the sentence lengths in each data set, corresponding sentences in language A and language B are determined. Finally, the correspondences between words are determined based on their co-occurrence relations in the data sets. In this and the other prior art cases, there is a premise that there is a word Wb in the language B that corresponds to a word Wa of the language A with a reasonable semantic accuracy.    [Patent Document 1] “Daily Language Computing and its Method” JP 2002-236681 A    [Patent Document 2] “Association Method for Words in Paginal Translation Sentences” JP 2002-328920 A[Non-Patent Document 1]    http://www2.crl.go.jp/it/al33/kuma/mrs_li/midisearch.htm.[Non-Patent Document 2]    Atsushi Fujita, Kentaro Inui, Yuji Matsumoto. “Text Correction Processing necessary for Paraphrasing into Plain Expressions”. Collection of Lecture Theses in 65th National General Meeting of Information Processing Society of Japan, 5th Separate Volume, 1T6-4, pp. 99-102, 2003.03.[Non-Patent Document 3]    Masahiro Murayama, Masahiro Asaoka, Masanori Tsuchiya, Satoshi Sato. “Normalization of Terms and Support for Paraphrasing Declinable words based on the Normalization”, Language Processing Society, 9th Annual General Meeting, pp85-88, (2003.3).    [Non-Patent Document 4] Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1): 61-74
As described above, in conventional machine translation, it is assumed that there are corresponding words in the two languages in question, and that corresponding document sets are available, on translation from one language to the other.
An object of the present invention is to provide a new and improved method of and apparatus for detecting a term used in one domain that approximately corresponds to a term in the other domain, and/or vice versa, even in the cases where there are no (1) known word pairs that correspond to each other in target domains, (2) document set pairs that are known to correspond to each other in advance, and/or (3) dictionaries or thesauri to aid the mappings in the domains in question.