This invention relates generally to method and apparatus for speech recognition, by detecting phonemes included in spoken words.
Various researches and developments take place recently for speech or vocal sound recognition, which is capable of handling a number of spoken words from many and unspecific persons. Speech recognition characterized by phoneme recognition is suitable for recognizing a number of words from unspecific persons because phoneme recognition is difficult to be influenced by the scattering among speakers, such as accent change. Furthermore, phoneme recognition is advantageous because a word dictionary is not required to have a large capacity since a speech signal is converted into a signal of less information in the form of phoneme strings, which signal corresponds to linguistics, and because the contents of the word dictionary can be readily produced and altered.
An important point, which must be considered when using such method of speech recognition, is to recognize phonemes correctly. Especially it is a difficult technical problem how to accurately effect the segmentation of an input audio signal for determining a consonant period and how to accurately recognize consonants. Various researches have been made hitherto for deriving a feature or peculiarity of a consonant or a group of consonants. However, only few conventional techniques can be found in connection with so called automatic recognizing technique in which segmentation is effected with respect to an input speech signal for determining the sort of phoneme.
Briefly describing a typical conventional technique involving segmentation for specifying phonemes, vocal sounds from a umber of speakers (speaking persons) are analyzed by using a filter bank so as to obtain results of analysis for each frame period, such as 10 msec. As a result, spectrum information is obtained which is used for obtaining feature parameters in turn. By using the feature parameters standard patterns are produced for respective groups of 5 vowels and consonants in advance to be stored. Then segmentation of the vocal sound is effected by using the feature parameters which have been obtained. The result of segmentation is used to be compared with the standard patterns for determining or discriminating a phoneme. Finally, a time series of phonemes produced as a result is compared with the contents of the word dictionary, which are expressed in terms of time series of phonemes, so as to output a word corresonding to an item whose degree of similarity is the highest, as the result of recognition.
In the above, when the way of variation of full-range power is such that recesses or concave portions, which will be referred to as dips, exist in time-dependent varying state, a frame period in which the power level is minimal is expressed by a reference n1, and a frame period existing before or after the frame period n1, and showing negative or positive maximal value in the varying speed, which is referred to as power differential value, of the power level are expressed by references n2 and n3 respectively. Assuming that a differential value at a given frame period n is expressed by WD(n), when WD(n2) and WD(n3) satisfy the following relationships, a period from n2 to n3 is treated as a consonant period. EQU WD(n2).ltoreq.-.theta..sub.w EQU WD(n3).ltoreq..theta..sub.w
wherein .theta..sub.w is a threshold for the prevention of addition of consonants where "addition" means erroneuos segmentation of a vowel period as a consonant period.
Then feature parameters indicative of features or pecuriarity of phonemes are obtained with respect to each consonant period to compare the same with standard patterns of respective phonemes, which have been provided in advance, for the classification of consonants with respect to each frame period. The result of such classification is then adapted to a consonant classification tree to classify consonants when conditions are coincided.
As described in the above, according to the conventional technique the power level of an input speech signal is obtained in connection with an entire frequency range thereof, and segmentation of each consonant included in each word is effected by using dips in the varying power level. Then classification of consonants is effected for each frame period, and finally the results of consonant classification are asigned to each portion of a consonant-classifying tree so as to effect classification of consonants. In this way the conventional technique requires a complex algorithm, while it is troublesome and time-consuming.
Furthermore, in the prior art since calculations for obtaining the degree of similarity are effected for respective frames included in an entire period determined by segmentation, the entire period is equally treated with an assumption that the entire consonant period is statical.
However, apart from vowels, the feature parameter of consonants and semivowels varies within its period as time goes where the varying state thereof shows the feature of each phoneme. Each portion having a feature, which may be referred to as a feature portion, varies throughout the sorts of consonants and semivowels. For instance, in voiced sounds and unvoiced plosive sounds features, which will be used for the determination or discrimination of the phoneme, are consentrated around the plosion; in nasal sounds features being consentrated at a transient portion to a following vowel; and in "r" sound and semivowels its features being represented by the variation of parameters throughout the entire period of the phoneme
Therefore, the determination of consonants and semivowels may be effectively carried out by discriminating phonemes paying attention to the variation of parameters along time base in the feature portion with the feature portion for the discrimination of respective phonemes being extracted. In conventional techniques such processing is not involved.