This invention relates to a method and apparatus for speech recognition, particularly, though not exclusively, to a method and apparatus for recognising speech in a tonal language, such as Mandarin Chinese.
Speech recognition techniques are well known for recognising words spoken in English or other non-tonal languages. These known speech recognition techniques basically perform transformations on segments (frames) of speech, each segment having a plurality of speech samples, into sets of parameters sometimes called xe2x80x9cfeature vectorsxe2x80x9d. Each set of parameters is then passed through a set of models, which has been previously trained, to determine the probability that the set of parameters represents a particular known word or part-word, known as a phoneme, the most likely word or phoneme being output as the recognised speech.
However, when these known techniques are applied to tonal languages, they generally fail to deal adequately with the tone-confusable words that can occur. Many Asian languages fall in this category of tonal languages. Unlike English, a tonal language is one in which tones have lexical meanings and have to be considered during recognition. A typical example is Mandarin Chinese. There are more than 10,000 commonly used Chinese characters, each of which is mono-syllabic. All these 10,000 characters are pronounced as just 1345 different syllables, the different meanings of a particular syllable being determined by the listener from the context of the speech. In fact, from the 1345 different syllables, a non-tonal language speaker would probably distinguish just over 400 different sounds, since many of the syllables sound similar and can only be distinguished by using different tones. In other words, if the differences among the syllables due to tone are disregarded, then only 408 base syllables instead of 1345 tonal syllables would be recognised in Mandarin Chinese. However, this would cause substantial confusion since all the tonal syllables having the same base syllable would be recognised as the same syllable. A well-known example is that, in Mandarin, both xe2x80x9cMOTHERxe2x80x9d and xe2x80x9cHORSExe2x80x9d are sounded as xe2x80x9cmaxe2x80x9d but distinguished by differences in tone.
As shown in FIGS. 1A, 1B, 1C, 1D and 1E, in Mandarin Chinese there are four lexical tones: a high and level Tone 1, a rising Tone 2, a falling-rising Tone 3, and a falling Tone 4; and one neutral Tone 5, which is used on some syllables that are a suffix to a word. However, in other tonal languages there may be different numbers of tones, for example seven, as in Cantonese Chinese. It is known that the tones are primarily characterized by their pitch contour patterns. The pitch is equivalent to the fundamental frequency of the audio signal and the pitch contour is equivalent to the frequency contour. Thus, one known tonal language speech recognition system, such as that described in U.S. Pat. No. 5,602,960 (Hsiao-Wuen Hon, et al), uses a syllable recognition system, a tone classifier and a confidence score augmentor. The tone classifier has a pitch estimator to estimate the pitch of the input once and a long-term tone analyser to segment the estimated pitch according to the syllables of each of the N-best theories. The long-term tone analyser performs long term tonal analysis on the segmented, estimated pitch and generates a long-term tonal confidence signal. The confidence score augmentor receives the initial confidence scores and the long-term tonal confidence signals, modifies each initial confidence score according to the corresponding long-term tonal confidence signal, re-ranks the N-best theories according to the augmented confidence scores, and outputs the N-best theories. This system is, however, computational resource intensive and is also language dependent, in that the syllables are recognised first and then classified into the particular tones for which the system has been calibrated or trained. Thus, if the language is to be changed from, for example Mandarin Chinese to Cantonese Chinese, not only does the syllable recogniser need retraining, but the tone classifier also needs to be recalibrated for seven tones instead of only five.
Another known way of recognising syllables in a tonal language is described in U.S. Pat. No. 5,806,031 (Fineberg) in which a tonal sound recogniser computes feature vectors for a number of segments of a sampled tonal sound signal in a feature vector computing device, compares the feature vectors of a first of the segments with the feature vectors of another segment in a cross-corrrelator to determine a trend of a movement of a tone of the sampled tonal sound signal, and uses the trend as an input to a word recogniser to determine a word or syllable of the sampled tonal sound signal. In this system, the feature vector is computed for all syllables, irrespective of whether they are voiced or unvoiced.
A voiced sound is one generated by the vocal cords opening and closing at a constant rate giving off pulses of air. The distance between the peaks of the pulses is known as the pitch period. An example of a voiced sound is the xe2x80x9cixe2x80x9d sound as found in the word xe2x80x9cpillxe2x80x9d. An unvoiced sound is one generated by a single rush of air which results in turbulent air flow. Unvoiced sounds have no defined pitch. An example of an unvoiced sound is the xe2x80x9cpxe2x80x9d sound in the word xe2x80x9cpillxe2x80x9d. A combination of voiced and unvoiced sounds can thus be found in the word xe2x80x9cpillxe2x80x9d, as the xe2x80x9cpxe2x80x9d requires the single rush of air and the xe2x80x9cillxe2x80x9d requires a series of air pulses.
Although essentially all languages use voiced and unvoiced sounds, in tonal languages the tone occurs only in the voiced segments of the words.
The present invention therefore seeks to provide a method and apparatus for speech recognition, which overcomes, or at least reduces the above-mentioned problems of the prior art.
Accordingly, in a first aspect, the invention provides a system for speech recognition comprising an input terminal for receiving a segment of speech, a speech classifier having an input coupled to the input terminal and an output to provide an indication of whether the speech segment comprises voiced or unvoiced speech, a speech feature detector having a first input coupled to the input terminal, a second input coupled to the output of the of the speech classifier, and an output to provide a speech feature vector having a plurality of feature values indicating features of the speech segment, the speech feature vector including at least a tonal feature value indicating a tonal feature of the speech segment when the speech segment comprises voiced speech, and a speech recogniser having an input coupled to the output of the speech feature detector and an output to provide an indication of which of a predetermined plurality of speech models is a good match to the speech segment.
In a preferred embodiment, the system further comprises an Analog-to-Digital (A/D) converter having an input coupled to the input terminal and an output coupled to the inputs of the speech classifier and the speech feature detector to provide a digitised speech segment.
The output of the speech recogniser preferably provides an indication of which one of the predetermined plurality of speech models is a best match to the speech segment.
Preferably, the system further comprises a memory coupled to the speech recogniser for storing the predetermined plurality of speech models, and a speech model trainer having an input selectively coupled to the output of the speech feature detector and an output coupled to the memory to store in the memory the predetermined plurality of speech models after the predetermined plurality of speech models have been trained using the speech feature vector.
The speech feature detector preferably comprises a non-tonal feature detector having an input coupled to the input of the speech feature detector and an output to provide at least one non-tonal feature value for the speech segment, a tonal feature detector having a first input coupled to the input of the speech feature detector, a second input coupled to the output of the speech classifier and an output to provide at least one tonal feature value for the speech segment when the speech classifier determines that the speech segment comprises voiced speech, and a speech feature vector generator having a first input coupled to the output of the non-tonal feature detector, a second input coupled to the output of the tonal feature detector, and an output coupled to the output of the speech feature detector to provide the speech feature vector.
The non-tonal feature detector preferably comprises a non-tonal speech transformation circuit having an input coupled to the input of the non-tonal feature detector and an output to provide a transformed non-tonal signal, and a non-tonal feature generator having an input coupled to the output of the non-tonal speech transformation circuit and an output coupled to the output of the non-tonal feature detector to provide the at least one non-tonal feature value for the speech segment.
The tonal feature detector preferably comprises a tonal speech transformation circuit having first and second inputs coupled to the first and second inputs of the tonal feature detector and an output to provide a transformed tonal signal, and a tonal feature generator having an input coupled to the output of the tonal speech transformation circuit and an output coupled to the output of the tonal feature detector to provide the at least one tonal feature value for the speech segment.
In one preferred embodiment, the tonal speech transformation circuit comprises a pitch extractor having an input coupled to the first input of the tonal speech transformation circuit and an output, and a tone generator having a first input coupled to the output of the pitch extractor and an output coupled to the output of the tonal speech transformation circuit to provide the transformed tonal signal indicative of the tone of the speech segment.
The tone generator preferably has a second input coupled to the second input of the tonal speech transformation circuit.
In a second aspect, the invention provides a method of speech recognition comprising the steps of receiving a segment of speech, classifying the speech segment according to whether the speech segment comprises voiced or unvoiced speech, detecting a plurality of speech features of the segment of speech, generating a speech feature vector having a plurality of feature values indicating the detected plurality of features of the speech segment, wherein the speech feature vector includes at least a tonal feature value indicating a tonal feature of the speech segment when the speech segment comprises voiced speech, and utilising the speech vector to determine which of a predetermined plurality of speech models is a good match to the speech segment.
The method preferably further comprises the step of digitising the segment of speech to provide a digitised speech segment.
Preferably, the step of utilising the speech vector determines which of the predetermined plurality of speech models is a best match to the speech segment.
In a preferred embodiment, the method further comprises the steps of training the predetermined plurality of speech models using the speech feature vector, and storing the predetermined plurality of speech models after the predetermined plurality of speech models have been trained.
Preferably, the step of detecting a plurality of speech features comprises the steps of generating at least one non-tonal feature value for the speech segment, generating at least one tonal feature value for the speech segment when the speech classifier determines that the speech segment comprises voiced speech, and combining the at least one non-tonal feature value and the at least one tonal feature value to provide the speech feature vector.
Preferably, the step of detecting at least one non-tonal feature value comprises the steps of, transforming the speech segment using at least a first transformation to provide a transformed non-tonal signal, and generating the at least one non-tonal feature value from the transformed non-tonal signal.
Preferably, the step of detecting at least one tonal feature value comprises the steps of transforming the speech segment using at least a second transformation to provide a transformed tonal signal, and generating the at least one tonal feature value from the transformed tonal signal.
In one preferred embodiment, the step of transforming the speech segment comprises the steps of extracting pitch information from the speech segment, and generating the transformed tonal signal from the extracted pitch information.