Documents having multiple descriptions in the same language and that share the same content frequently employ terms in those descriptions that differ depending upon the degree of specialized knowledge the authors have about the topic and the different social strata, such as sex or age groups, to which the authors belong. Even if the descriptions are about a common topic, terms used by a non-expert and terms used by an expert in their respective expression domains can be quite different.
It is an object of the present invention to provide a new and improved method of, apparatus and other necessary technologies for detecting terms used by a non-expert that correspond to what are meant by the terms used by an expert and, conversely, detecting the terms used by an expert that correspond to what are meant by the terms used by a non-expert between such different domains.
A typical example of technology for converting documents in different domains is a translation machine. Technology that makes a computer perform the task of a translation machine has been known for some time. A translation machine automatically translates a document written in a natural language into another natural language with a computer program using a term database, a program for processing grammatical rules, databases of usage and sentence examples, and other system specific components. Such technology has already been put to practical use, and there are commercial language translations software products for personal computers. Some translation services are also provided on the Internet. In addition, small hand-held devices for word-by-word translation are widely available. A word-by-word translation machine converts one word in a certain language into a word in another language with an identical meaning. Basically, precompiled dictionaries are stored in a storage device, and an input word is converted into a corresponding word in another language. These conventional technologies have a precondition for converting documents from one domain to another domain; namely, a sentence in one domain must be known to correspond to a sentence in the other domain and a word in one domain must be known to correspond to a word in the other domain.
Paraphrasing research for converting a difficult expression into an easier expression in an identical language have already been published. For example, there is reported research by Atsushi Fujita, et al. (2003) and Masahiro Murayama, et al. (2003). In the research concerning “paraphrasing,” the basic technique is to find expression patterns to be replaced by predetermined expression patterns according to pattern matching rules. Other approaches in language translation utilize statistical and/or probabilistic models. These model-based approaches initially prepare a pair of data sets, in different languages, having contents that are known to be the same. Next, based on information, such as the sentence lengths in each data set, corresponding sentences in language A and language B are determined. Finally, the correspondences between words are determined based on their co-occurrence relations in the data sets. In this and the other prior art cases, there is a premise that there is a word Wb in the language B that corresponds to a word Wa of the language A with a reasonable semantic accuracy.    [Patent Document 1] “Daily Language Computing and its Method” IP 2002-236681 A    [Patent Document 2] “Association Method for Words in Paginal Translation Sentences” JP 2002-328920 A    [Non-Patent Document 1]    http://www2.crl.go.jp/it/a133/kuma/mrsJilmidisearch.htm.    [Non-Patent Document 2]    Atsushi Fujita, Kentaro Inui, Yuji Matsumoto. “Text Correction Processing necessary for Paraphrasing into Plain Expressions”. Collection of Lecture Theses in 65th National General Meeting of Information Processing Society of Japan, 5th Separate Volume, 1 T6-4, pp. 99-102, 2003.03.    [Non-Patent Document 3]    Masahiro Murayama, Masahiro Asaoka, Masanori Tsuchiya, Satoshi Sato. “Normalization of Terms and Support for Paraphrasing Declinable words based on the Normalization”, Language Processing Society, 9th Annual General Meeting, pp 85-88, (2003.3).    [Non-Patent Document 4] Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1): 61-74
As described above, in conventional machine translation, it is assumed that there are corresponding words in the two languages in question, and that corresponding document sets are available, on translation from one language to the other.
An object of the present invention is to provide a new and improved method of and apparatus for detecting a term used in one domain that approximately corresponds to a term in into expert documents and naive documents that contain different types of language expressions based on the term frequencies and other information in the documents. Since terms appearing in a target expert document and a target naive document are not always identical, correlations between the terms in the two different domains are calculated next. The basic concept is as follows: associations of a term or a set of terms that appear in either of the expert or naïve domains and a term or a set of terms that appear in the other domain are obtained on the basis of co-occurrence relations among the terms in the expert document set and a naive document set that are known to be written about an identical object.
An example of an application of the present invention is a recommendation system for customers who are about to buy some products or goods. Even if documents are written about an identical object, such as merchandise, there are usually considerable differences between terms used by an expert with deep knowledge about the object and terms used by a non-expert with little knowledge about it. The expert often describes an object using technical terms and knowledge specific to the object, but the non-expert, without such knowledge, cannot but describe the object with expressions based on perceptions or by way of similar objects or examples. The expert tries to explain the product in detail with his/her knowledge about where it was made and/or what material it is made from, while the non-expert tries to describe the same product using perception-based terms that come to mind. It is almost impossible for a general consumer to have detailed knowledge of products and proper names concerning products in all fields of interest. Thus, even if an expert explains and recommends, to a non-expert, a particular product, which in fact requires specialized knowledge to choose wisely, it is conceivable that the non-expert may not understand the explanation sufficiently before the purchase.
By applying the present invention, the seller is able to provide the sufficient information about the product to consumers in a vocabulary the consumers understand, and conversely, the general consumer can easily understand the information about products and select the information that suits to his/her preferences and tastes.