1. Field of the Invention
The present invention relates in general to a phonetic distance calculation method for similarity comparison between phonetic transcriptions of foreign words, and more particularly to an improved phonetic distance calculation method which is capable of applying an edit distance measure, generally used for word spelling comparison, to phonetic transcriptions of foreign words, so that the phonetic transcriptions can effectively be retrieved in a document retrieval system.
2. Description of the Prior Art
In order to efficiently utilize a document keeping space with techniques of a computer field being developed, it is common that documents are not kept in the form of paper, but converted into information data and stored in computers.
To this end, there has been proposed a document retrieval system for rapidly retrieving a desired one from the stored documents. The document retrieval system presents all documents containing similar contents using keywords, resulting in an increase in convenience to the user.
On the other hand, with various exchanges with foreign countries recently increasing, phonetic transcriptions of many foreign words have been used in Korean documents. Most of the phonetic transcriptions are concerned with proper nouns or technical terms originally expressed in English. In particular, it is common that scientific and technological fields have no choice but to employ the phonetic transcriptions, because there is no Korean translation for such English technical terms. However, there is a severe individual difference in the phonetic transcriptions of the foreign words, thus making it difficult to retrieve Korean document texts on the basis of such phonetic transcriptions.
For example, three Korean phonetic transcriptions such a xe2x80x9cz,1xe2x80x9d, xe2x80x9cz,2xe2x80x9d and xe2x80x9cz,3xe2x80x9d may be used together with respect to an English technical term xe2x80x9cdigitalxe2x80x9d. Among these Korean phonetic transcriptions, the xe2x80x9cz,4xe2x80x9d has been proposed as a standard, but the xe2x80x9cz,2xe2x80x9d has actually been more frequently used and, occasionally, the xe2x80x9cz,3xe2x80x9d has been used according to private views.
For this reason, documents with various phonetic transcriptions may not often be retrieved unless a diversity of the phonetic transcriptions is considered in the document retrieval.
In order to overcome such a problem, there has been proposed a method for grouping various Korean phonetic transcriptions derived from the same foreign word into an equivalence class and automatically expanding them upon document retrieval [see: Jeong, K. S., Kwon, Y. H., and Myaeng, S. H., xe2x80x9cThe Effect of a Proper Handling of Foreign and English Words in Retrieving Korean Textxe2x80x9d, In Proceedings of the 2nd International Workshop on Information Retrieval with Asian Languages (IRAL ""97), 1997].
The creation of such a phonetic transcription equivalence class requires a method for determining whether two given phonetic transcriptions are derived from the same foreign word, namely, for comparing a similarity between the two phonetic transcriptions.
The above phonetic transcription similarity comparison method is also basically necessary to an approximate search for a phonetic transcription (words of foreign origin) database. For example, the similarity comparison method may be usefully utilized for the search for either firm names or trademarks of words of foreign origin.
Unfortunately, it is the reality that no method has been developed until now for similarity comparison between Korean phonetic transcriptions and an edit distance measure (see: Hall, P. and Dowling, G., xe2x80x9cApproximate string matchingxe2x80x9d, Computing Surveys, Vol. 12, No. 4, pp. 381-402, 1980) or an N-gram metric (see: Zamora, E., Pollock, J., and Zamora, A., xe2x80x9cThe use of trigram analysis for spelling error detectionxe2x80x9d, Information Processing and Management, Vol. 17, No. 6, pp 305-316, 1981) has merely been utilized as an approach to the similarity comparison. Either the edit distance measure or N-gram metric is a character string similarity comparison method which is independently applicable to words.
The character string similarity comparison method is to detect whether two given character strings are similar in spelling. Because Korean words are spelled using phonetic symbols, they are liable to be analogously pronounced if they are similar in spelling. In this connection, the character string similarity comparison method may relatively effectively be utilized for similarity comparison between Korean phonetic transcriptions.
Now, a description will be given of a conventional method for similarity comparison between phonetic transcriptions of foreign words.
Fred J. Damerau has proposed a method for assuming that typing errors result from only four cases; (1) insertion of one character, (2) deletion of one character, (3) substitution of one character with a different one and (4) transposition of two adjacent characters, and measuring a similarity between two given words on the basis of the minimum number of typing errors between the two words (see: Damerau, F., xe2x80x9cA technique for computer detection and correction of spelling errorsxe2x80x9d, Communications of the ACM, 7, pp. 171-176, 1964). This metric is typically called a Damerau-Levenshtein metric or an edit distance measure. The minimum number of typing errors between two words s and t can be calculated on the basis of the following recurrent equation (see: Wagner, R. A., xe2x80x9cOrder-n correction for regular languagesxe2x80x9d, Communications of the ACM, vol. 17, No. 5, pp. 265-268, 1974):
Here, the function d is a distance between two characters and can simply be expressed by the following equation:       d    ⁡          (                        s          i                ,                  t          j                    )        =      {                                                      0              ⁢                              xe2x80x83                            ⁢              if              ⁢                              xe2x80x83                            ⁢                              s                i                                      =                          t              j                                                                                      1              ⁢                              xe2x80x83                            ⁢              if              ⁢                              xe2x80x83                            ⁢                              s                i                                      ≠                          t              j                                          
It should be noted that the distance function d may be expressed by a more complex equation according to a desired purpose.
In the case where the above edit distance measure is applied to similarity comparison between Korean phonetic transcriptions, it is effective to consider only the insertion, deletion and substitution because the transposition is valid with respect to only the typing error cases. It is further effective to perform the similarity comparison after the removal of an initial phoneme xe2x80x98xe2x80x99 in the Korean phonetic transcriptions because it has no phonetic value.
The above edit distance measure or N-gram metric is a word spelling comparison method which is independently applicable to words and can relatively effectively be utilized for similarity comparison between Korean phonetic transcriptions. However, this edit distance measure or N-gram metric is not the best for pronunciation similarity comparison. For example, Korean phonetic transcriptions xe2x80x9cxe2x80x9d and xe2x80x9cxe2x80x9d are very similar in spelling, but come from different English technical terms xe2x80x9cdigitalxe2x80x9d and xe2x80x9cdigitxe2x80x9d, respectively. For this reason, the conventional word spelling comparison method has a difficulty in performing similarity comparison between such Korean phonetic transcriptions.
Accordingly, the phonological structure of a foreign language as the origin should be considered for the effective similarity comparison between Korean phonetic transcriptions. For example, a Korean phonetic transcription xe2x80x9cxe2x80x9d of an English word xe2x80x9crobotxe2x80x9d is similar in English-style pronunciation to a Korean phonetic transcription xe2x80x9cxe2x80x9d with two different character elements, rather than a Korean phonetic transcription xe2x80x9cxe2x80x9d with one different character element. This results from the fact that a final phoneme /t/ of the English word is usually changed to a Korean phoneme / */ or xe2x80x9cxe2x80x9d, where the symbol * indicates that / / is a final consonant.
Consequently, the above-mentioned conventional method is effective in performing the word spelling comparison, but it has a difficulty in performing the pronunciation similarity comparison. As a result, an undesired document is retrieved or a desired document is not retrieved in a document retrieval system. In other words, a document retrieval operation cannot accurately be performed in the document retrieval system.
Therefore, the present invention has been made in view of the above problem, and it is an object of the present invention to provide a phonetic distance calculation method for similarity comparison between phonetic transcriptions of foreign words, which is capable of defining character element transformation patterns liable to occur between phonetic transcriptions coming from the same foreign language, assigning a demerit mark to each of the character element transformation patterns according to a phonetic distance and calculating a minimum phonetic distance between two given phonetic transcriptions on the basis of a minimum edit distance calculation method used in an edit distance measure, so that a document retrieval operation can accurately be performed in a document retrieval system.
In accordance with the present invention, the above and other objects can be accomplished by a provision of a phonetic distance calculation method for similarity comparison between phonetic transcriptions of foreign words, comprising the first step of defining character element transformation patterns occurrable between phonetic transcriptions derived from the same foreign language; the second step of generating new phonetic transcriptions according to the defined character element transformation patterns and assigning a demerit mark to each of the generated phonetic transcriptions according to a phonetic distance; the third step of calculating a minimum phonetic distance between each of the generated phonetic transcriptions and a given phonetic transcription on the basis of a minimum edit distance calculation method; and the fourth step of determining that any one of the generated phonetic transcriptions with a smallest one of the calculated minimum phonetic distances is most similar to the given phonetic transcription.
Preferably, the above first step may include the step of classifying the character element transformation patterns into three types; substitution of one character element with a different one, insertion or deletion of one character element and expansion of one character element into two character elements or contraction of two consecutive character elements into one character element, classifying the three types of character element transformation patterns into consonants and vowels and then classifying the consonants into final and initial consonants.
Further, preferably, the above second step may include the step of assigning the demerit mark to each of the generated phonetic transcriptions according to a minimum amount of transformation operation required for the transformation of a corresponding one of the generated phonetic transcriptions into a different one.