This application claims priority from Korean Patent Application No. 2003-11345, filed on Feb. 24, 2003, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Invention
The present invention relates to speech recognition, and more particularly, to a speech recognition method and system using inter-word phonetic information.
2. Description of the Related Art
A general continuous speech recognition system has a structure as illustrated in FIG. 1. Referring to FIG. 1, a feature extraction unit 11 extracts feature vectors, which represents input speech data in a form suitable for speech recognition system. With an acoustic model database 13, a pronunciation dictionary database 14, and a language model database 15, which are previously established through learning processes, a search unit 12 takes the feature vectors and computes which word sequences most likely produce. In Large Vocabulary Continuous Speech Recognition (LVCSR), words searched by the search unit 12 have a form as tree structure. A post-processing unit 16 removes the phonetic representations and tags from what has resulted in search unit 12, then symbolize it in terms of syllable, and finally produces hypothesis in text form.
Examples of Korean and English words and their possible pronunciation representations stored in the pronunciation dictionary database 13 are shown in FIGS. 2A and 2B. As shown in FIG. 2A, the word “[dehak]” 21, which means a university, may takes its pronunciation representation as either one of following; [dehak] 21a, [dehaη] 21b, and [dehag] 21c. When it comes to the word “[dehaη]” 22, as is another word example, it means an opposition and its pronunciation can be represented as [dehaη] 22a. But, it is almost impossible to distinguish pronunciation representations between [dehaη] 22b and “[dehaη]” 22a in that the both of pronunciation representations are identical. Referring to FIG. 2B, the word “seat” 23 may take its pronunciation representation either [sit] 23a or [sip] 23b. However, the pronunciation representation [sip] 23b is substantially indistinguishable from the pronunciation representation [tip] 24a for the word “tip” 24.
An example of search process with the pronunciation dictionary database 14 in the search unit 12 will be described in FIG. 3. In order to recognize the word sequence, “[hanguk dehak I]”, each of words, “hanguk”, “dehak”, and “i” is commonly fractionized into onset, which is initial consonants in syllable, nucleus, which is phonetically steady portion, and coda, which is final consonants in syllable. Here is an example for further understanding. First of all, when it comes to the word, “[hanguk]”, a pronunciation sequence is generated with possible onset 31, and coda 33 except nucleus, [angu] 32. Next, in the case of word, “[dehak]”, as a similar manner, a pronunciation sequence is generated with possible onset and coda 34 and 36 except nucleus, [eha] 35. The pronunciation representation 37 for the word “[i]” is generated. A subsequent searching process is performed on the generated presentation representations with the probability functions Pr([dehak]|[hanguk]) and Pr([i]|[dehak]). There are two combinations between the words “” and “”. In addition, there are three combinations between the words “” and “”. The word “” means a Korean and “” takes a role as an auxiliary word for a subjective case.
For building up Large Vocabulary Continuous Speech Recognition (LVCSR) system, a pronunciation dictionary, which represents words of interest, should be defined ahead. In general, coarticulation effects frequently happen either between phonemes or between words. When coarticulatrion effects appear at the boundary of successive words, each of words cannot be correctly recognized, but it could also have variations of acoustic properties in line with neighborhood context. Accordingly, these phenomena must be considered in modeling pronunciation dictionary for speech.
In particular, various phonological changes appear saliently in Korean spoken language depending on the phonemic context. Accordingly, there is a need to provide various pronunciation representations for each word based on such phonological changes. In general, intra-word pronunciation representations have substantially constant phonemic contexts, so that they can be easily modeled based on phonological rules through learning, for example, using triphone models. However, inter-word phonemic contexts vary depending on the surrounding words, so that more delicate modeling is required to reflect the complicated relevant phonological rules.
To consider inter-word phonological changes, multiple pronunciation representations for each word, including all or major probable inter-word phonemic contexts, may be incorporated to build up a dictionary. Alternatively, a method of modeling inter-word phonological variations by use of more mixed Gaussian functions providing more state outputs to a HMM may be used. However, the former method expands the sizes of the dictionary and network. The latter method requires substantial computational processing and time and leads to a slow recognition rate. There is another method involving selecting more frequent inter-word phonological changes and applying language model-based, modified phonemic contexts to a recognition network using a crossword triphone model. In this method, multiple begin nodes are assigned to each word so as to consider its various phonemic contexts with the preceding word. As a result, this method drops sharing efficiency in tree-structured recognition networks and leads to extensive network size. Furthermore, in a method of using a tree-structure recognition network where the leading phonemic contexts of words are applied during recognition, not prior to recognition, when there are more than one alternative phonology rules which are applicable in a particular phonological environment, limitation to one of them is impossible. In addition, this method increases the computational load because pronunciation rules must be applied on a frame by frame basis and the recognition network must be continuously updated during recognition.