1. Technical Field
The present invention relates generally to speech recognition, and in particular, to a method and apparatus for automatic recognition of words of speech having at least one syllable with tonal content.
2. Description of Related Art
Speech recognition is a technology which converts an acoustic speech signal (human speech) into text. The apparatus which utilizes this technology, usually a computer system with speech recognition software, is called an automatic dictation machine. This technology has found broad applications in speech transcription, voice activated information systems, as well as speech command and control systems. The early successful applications of speech recognition technology involved European languages, such as English, German, Spanish, etc. For such languages, the pitch contours are not phonemic, i.e., different pitch contours do not imply different lexical meanings.
Another category of languages is tone languages, in which each syllable has a tone (pitch contour) associated with it. Tone, by definition, is a property of a syllable. For such languages, the pitch contours are phonemic. This means that syllables having the same sequence of consonant(s) and vowel(s) but different pitch contours represent different morphemes and have entirely different meanings. Examples of tone languages include various Chinese languages (such as Mandarin, Cantonese, Taiwanese or Mînnxc3xa1nyû), Southeast Asian languages (such as Thai and Vietnamese), Japanese, Swedish, and Norwegian. The Chinese languages have the largest total number of speakers out of all languages, with Mandarin being the main dialect. The second most popular tone language, Cantonese, is spoken in Hong Kong, Guxc3xa2ngdõng province, and by Chinese people outside China.
Because of the vast numbers of characters in some tone languages, especially Chinese, text input into computers using keyboards is especially difficult. Therefore, speech recognition of tone languages is a particularly important alternative, which if realized with reasonable accuracy, speed and price, would be an invaluable tool for revolutionizing computer use for those speaking tone languages.
The traditional method of automatic speech recognition of tone languages usually includes two steps. In the first step, the consonants and vowels are recognized and syllables are constructed from these consonants and vowels; thus the syllables without tone are recognized. In the second step, the pitch contour of each syllable is examined to identify the tone of the syllable. However, such a two-step process often creates errors and in addition, is not compatible with speech recognition systems for European languages; thus its application is limited.
In U.S. Pat. No. 5,751,905, entitled xe2x80x9cStatistical Acoustic Processing Method and Apparatus for Speech Recognition Using a Toned Phoneme Systemxe2x80x9d, a method was introduced for recognizing tone languages, especially Mandarin. In particular, it disclosed a method in which a syllable was divided into two roughly equal parts, or demisyllables, where the pitch information of the first demisyllable, including the initial consonant and possibly a glide (semivowel), was assumed to be disposable, and the pitch information in the second demisyllable, including the main vowel and the ending, was assumed to be sufficient for determining the tone of the entire undivided syllable. In standard Mandarin, there are 20 different second demisyllables and 5 different tones: high (yinping), rising (yangping), low (shang), falling (qu), and untoned or neutral (qing).
By assigning these tones to each second demisyllable, a total of 114 phonemes with tone (tonemes) could be defined. In the training process, each of the tonemes, or phonemes with different tones, is trained as an independent phoneme, and during the recognition process, the tonemes are recognized as independent phonemes. The tone of a syllable is defined as the tone of the second demisyllable, or the tone of the toneme in that syllable. This method results in a highly accurate Mandarin speech recognition system. The apparatus utilizing the method in U.S. Pat. No. 5,751,905, xe2x80x9cVIAVOICE(trademark) Chinesexe2x80x9d, was the first continuous Mandarin dictation product developed, and has been the most successful Mandarin dictation product on the market since its debut in 1997.
The method of U.S. Pat. No. 5,751,905 was not as effective in automatic recognition of Cantonese. Cantonese has a significantly greater number of second demisyllables than Mandarin, and has 9 tones (instead of 5 as in Mandarin). Other tone languages, such as Thai and Vietnamese, also have a significantly greater number of second demisyllables than Mandarin. Thus, using the method described above results in a total of almost 300 phonemes that must be defined. Such a large number of phonemes make training and recognition very difficult. In addition, due to xe2x80x9certificationxe2x80x9d (an expression used here to describe the process whereby a syllable""s ending is changed by adding xe2x80x9crxe2x80x9d), the number of second demisyllables with tone in the Beijing dialect also approaches 300. Accordingly, an efficient and accurate automatic speech recognition technique for recognizing tone languages, in particular, those languages having high numbers of endings and tones, is highly desirable.
The present invention is directed to a method and apparatus for efficient automatic recognition of tone languages. Advantageously, the present invention significantly reduces the total number of phonemes that must be defined, thus simplifying the training process and enabling quicker decoding, while at the same time maintaining or in certain cases, improving accuracy in recognizing speech.
According to an aspect of the present invention, an apparatus for recognition of tone languages is provided including means for defining toned vowels as different phonemes comprising a database comprising prototypes of phonemes including toned vowels, a signal processing unit for generating a vector including a pitch value; and means for recognizing toned vowels by matching said prototypes of phonemes including toned vowels to said vector.
According to another aspect of the present invention, a method for defining toned vowels in words of speech is provided comprising the steps of preparing a training text from said words of speech, transcribing said training text into sequences of phonemes including vowels with tones, converting said training text into an electrical signal, generating spectral features from said electrical signal, extracting pitch values from said electrical signal, combining said spectral features and said pitch values into acoustic feature vectors, and comparing said acoustic feature vectors with said sequences of phonemes including vowels with tone to produce acoustic prototypes for each phoneme.
In yet another aspect of the present invention, a method for identifying toned vowels in words of speech is provided comprising the steps of converting the words of speech into an electrical signal, generating spectral features from said electrical signal, extracting pitch values from said electrical signal, combining said spectral features and said pitch values into acoustic feature vectors, comparing said acoustic feature vectors with prototypes of phonemes in an acoustic prototype database including prototypes of toned vowels to produce labels, and matching said labels to text using a decoder comprising a phonetic vocabulary and a language model database.
These and other aspects, features, and advantages of the present invention will be described or become apparent from the following detailed description of the preferred embodiments, which is to be read in connection with the accompanying drawings.