1. Field of the Invention
The present invention relates to a phonemic recognition system which automatically recognizes speech generated by human beings and can be used in applications such as expressing the results of that recognition in print, and more particularly to a phonemic recognition system in which a unit of recognition is a phoneme that is smaller than a word.
2. Description of the Prior Art
In a conventional arrangement of this type of phonemic recognition system, wide use has been made of a method in which a standard pattern is composed using a word as a unit of recognition and phonemic recognition is performed by matching input patterns with the standard pattern by using dynamic programming (hereinafter referred to as DP).
In this conventional pattern matching system, the unit adopted as a standard pattern is as large as the size of a word and has conveyed a high rate of recognition when the number of words is lower than about one hundred. However, when the standard pattern is formed in an interval where a recognition unit such as phoneme or phonemic particle is shorter than a word, this pattern matching system has not exhibited sufficient effectiveness. The reasons follow. It is difficult to form a standard pattern of phoneme which is subject to various deformations due to the preceding and subsequent phonemic environment because a pattern length in the standard pattern is short. The standard pattern cannot be matched correctly to the input patterns.
This disadvantage of the prior art will be explained in detail. First, an explanation will be made of continuous DP (CDP) matching to be used to recognize a word in which a plurality of phonemic particles are joined continuously.
When using continuous DP, a strength (spectrum) of each frequency domain in a speech input is given by the following equation. EQU {f(t,x):1.ltoreq.t&lt;.infin., 1.ltoreq.x.ltoreq.L} (1)
where t indicates an instant (time axis), and the speech input is sampled at each instant t=1, 2, 3, . . . A time interval between t=1 and t=2 is from 8 to 10 msec. The variable x is a frequency axis parameter number indicating a frequency domain. For example, when the spectrum in the frequency domain of the speech input is obtained with passing the speech input through a 20 channel band-pass filter, the channel number of the band-pass filter or the number indicating the frequency domain (the band number) is used as x. Consequently, f(t,x) indicates an output value of the band-pass filter on band x at instant t.
FIG. 1 is a table showing a spectrum f(t,x) for t=1 to t=4 and band number x=1 to 20. In the case of continuous speech, sampling will be made even after instant t=4. A standard pattern Z(.tau.,x) used for continuous DP is expressed by the following equation. EQU {Z(.tau.,x):1.ltoreq..tau..ltoreq.T, 1.ltoreq.x.ltoreq.L} (2)
The standard pattern Z(.tau.,x) shows a result of sampling in advance of the output at each instant from instant 1 to instant T, while a speaker is generating a single word solely, and at each band of the above-mentioned band-pass filter. Here, the time duration .tau. is equal to the above-mentioned time duration between two adjacent instants.
FIG. 2 shows an example of a standard pattern for .tau.=20, L=20.
In this case, the distance between the standard pattern and the input pattern is expressed by the following equation using an absolute value distance equation. ##EQU1## Next, using this distance d(t,.tau.), the distance at each point from .tau.=1 to .tau.=T is calculated. Applying the DP method, the following asymptotic equation gives the cumulative distance P(t,.tau.). ##EQU2## An initial condition for P(t,.tau.) is given as EQU P(-1, .tau.)=P(0,.tau.)=.infin., (1.ltoreq..tau..ltoreq.T)
Furthermore, the output value D(t) of the continuous DP is determined as ##EQU3## This value D(t) indicates an optimum distance between an input pattern at an instant t and the standard pattern.
In continuous speech recognition, when only continuous DP is used, it is usual that the local minimum value of D(t) expressed by equation (5) is obtained and a standard pattern giving that minimum value and an instant corresponding thereto is output as a unit of recognition. In this case, a size of the unit of recognition forming the standard pattern in a conventional system has been as large as a word. But, as pointed out above, in order to handle large vocabularies, non-specified speakers and continuous voice recognition systems, it is required that a fundamental unit of recognition be smaller than a word. When it is attempted to distinguish between a phoneme of an input pattern by using the above equations with referring to a standard pattern having a unit smaller than a word, a speech recognition system using conventional DP does not give a high rate of recognition. As explained above, the reason for this is that as the fundamental unit of recognition becomes smaller, a length of the standard pattern expressing that unit becomes shorter and accordingly a phonemic pattern varies greatly depending on preceding and subsequent phonemic patterns, so that it is not possible to determine a standard pattern nor to make an accurate matching. For this reason, there have been doubts about an efficacy of using the conventional method of pattern matching for recognition of a small unit like a phoneme.
However, if a word is used as the unit of recognition, the number of words that must be recognized rises practically to more than one thousand, and thus a very large computer would be needed to perform the calculations of the above equations (1) to (5) and a memory with a large memory capacity would be needed to store the data to be calculated. There is a further problem that a time required for these calculations would be very long. Considering these circumstances, it is clear that a unit of recognition smaller than a word such as phoneme or phonemic particle must be used in structuring a speech recognition system in particular for continuous speech, since there are only several tens of types of phonemes.