1. Field of the Invention
The present invention relates to a speech recognition system for recognizing an acoustic speech pattern through pattern matching with reference patterns, and more particularly to a speech recognition system which produces acoustic parameters from an input speech signal following utterance detection in each frame period.
2. Description of the Prior Art
Speech sounds uttered by human speakers vary with time to serve as a symbolic representation of information in the form of words and phrases. There have been developed speech recognition apparatus for automatically recognizing acoustic speech signals produced by human beings. While various efforts have heretofore been directed toward improvements in automatic speech recognition, present speech recognizers are far from achieving speech recognition capabilities comparable to those of human beings. Most practical speech recognition systems that are available today operate to effect pattern matching between an input speech pattern and reference speech patterns under certain conditions.
FIG. 1 of the accompanying drawings shows a conventional speech recognition system. The speech recognition system has an acoustic analyzer 2 which receives an input speech signal from a microphone 1. The acoustic analyzer 2 extracts acoustic parameters indicative of features of the input speech pattern, and also detects word or utterance boundaries thereof. Various processes are known for extracting such acoustic parameters from an input acoustic speech signal. According to one process, a plurality of channels each comprising a bandpass filter and a rectifier are provided for respective different passbands, and the bandpass filter bank produces an output signal representative of a time-dependent change in a spectral pattern of the input acoustic speech signal. The utterance boundaries are detected based on the power of the speech signal from the microphone 1 and also the number of zero crossings of the speech signal. The acoustic parameters can be represented by their time sequence Pi(n) where i=1, 2, . . . , I (I is the number of channels having respective bandpass filters) and n=1, 2,. . . , N (N is the number of frames used for speech recognition in the utterance boundary that is detected).
The acoustic parameter time sequence Pi(n) is supplied from the acoustic analyzer 2 to a mode selector 3 which may comprise a switch, for example. When the mode selector 3 is shifted to a contact terminal A, the speech recognition system is in a register mode. In the register mode, the acoustic parameter time sequence Pi(n) is stored as recognition parameters in a reference pattern memory 4, so that the acoustic speech pattern of the talker is stored as a reference speech pattern or "template" in the reference pattern memory 4. Generally, registered reference speech patterns have different frame numbers because of different speaking rates and different word durations.
When the mode selector 3 is shifted to a contact terminal B, the speech recognition system is in a recognition mode. In the recognition mode, the acoustic parameter time sequence of an input speech signal which is uttered by the speaker is supplied from the acoustic analyzer 2 to an input speech pattern memory 5 and temporarily stored therein. The input speech pattern stored in the input speech pattern memory 5 is then supplied to a distance calculator 6, which calculates the magnitudes of differences between the input speech pattern and the reference speech patterns of a plurality of words read out of the reference pattern memory 4. A minimum difference detector 7 connected to the distance calculator 6 then detects the word whose reference speech pattern has a minimum difference with the input speech pattern. The distance calculator 6 and the minimum difference detector 7 make up a pattern matcher 8 for recognizing the input word uttered by the speaker.
In this manner, an unknown utterance can be recognized by pattern matching between the input speech pattern and some registered reference speech patterns.
It is known that different utterances of the same word or phrase may have their spectral patterns continuing in respective different durations or including speech events whose timing is not the same for the utterances. For example, the Japanese word "HAI", which is equivalent to "yes" in English, may be uttered in some cases as "HAAAI" that continues in a prolonged duration. If the utterance "HAAAI" is applied as an input speech pattern to the speech recognition system, then its distance from a reference speech pattern or template for the word "HAI" is so large that the input speech sound will be recognized as a word different from "HAI".
Prior to the pattern matching, therefore, it is necessary to effect time normalization or dynamic time warping to realign portions of an utterance with corresponding portions of the template. The time normalization is an important process for higher recognition accuracy.
One method for carrying out such time normalization is dynamic programming (DP) matching as disclosed in Japanese laid-open patent publication No. 50-96104, for example. According to the DP matching, rather than using multiple reference speech patterns extending over different time durations, a number of time-normalized reference speech patterns are generated with a warping function, and the distances between the reference patterns and an input utterance pattern are determined, with a minimum distance being detected for recognizing the input utterance.
In the DP matching method, the number of frames of the registered reference speech patterns is not fixed. Furthermore, it is necessary to effect DP matching between all the registered reference speech pattern and an input utterance pattern. Therefore, if there are more words to be recognized, then the DP matching method requires more calculations.
The DP matching tends to yield a recognition error between partly analogous utterance patterns because the process relies for speech recognition upon the steady regions of utterances where the spectral pattern does not vary with time.
There has been proposed a process for time normalization which is free from the above drawbacks, as disclosed in Japanese patent application No. 59-106178, for example.
The proposed time normalization process will briefly be described below. An acoustic parameter time sequence Pi(n) is composed of a sequence of dots in a parameter space. For example, if an input utterance to be recognized is "HAI" and an acoustic analyzer has two bandpass filters with EQU Pi(n)=(P1 P2),
then the acoustic parameter time sequence of the input utterance is composed of a sequence of dots in a two-dimensional parameter space as shown in FIG. 2 of the accompanying drawings. The input utterance includes an unsteady region 9 having coarsely distributed dots and a quasi-steady region 10 having closely distributed dots. If an input utterance is fully steady, then its parameters do not vary, and the dots stay together in one spot in the parameter space.
The different time durations of utterances due to different speaking rates are primarily caused by different dot sequence densities of the quasi-steady regions 10 thereof, but not largely affected by the time durations of the unsteady regions 9 thereof.
As shown in FIG. 3 of the accompanying drawings, a path or trajectory 11 is estimated as a continuous curve approximating the overall acoustic parameter time sequence Pi(n). The trajectory 11 remains almost unchanged regardless of different time durations of utterances.
In view of the aforesaid property of the trajectory 11, there has also been proposed a process for time normalization, as disclosed in Japanese patent application No. 59-106177, for example.
According to the other proposed time normalization process, a trajectory 11 is estimated as a continuous curve Pi(s) from a start point Pi(1) of the acoustic parameter time sequence Pi(n) to an end point Pi(N) thereof, and the length L of the curve Pi(s) is determined, as shown in FIG. 4 of the accompanying drawings. The trajectory 11 is then sampled again at intervals Le therealong. Specifically, if the trajectory 11 is at be re-sampled at M points 12 therealong, then it is re-sampled at intervals Le each indicated by: EQU Le=L/(M-1).
A parameter time sequence Qi(m) (i=1, 2, . . . , I, m=1, 2, . . . , M) representative of re-sampled points of the trajectory 11 possesses basic information of the trajectory 11, and remains almost unchanged regardless of different utterance time durations. Consequently, the parameter time sequence Qi(m) is time-normalized.
The parameter time sequence Qi(m) thus produced is registered as a reference speech pattern. To recognized any unknown input utterance pattern, it is also produced as a parameter time sequence Qi(m). The distance between the input utterance and reference speech patterns is determined based on these parameter time sequences Qi(m). A minimum distance is detected for recognizing the input utterance while any different time durations between utterances are being normalized.
According to the above time normalization process, the number of frames of a parameter time sequence Qi(m) is always M irrespective of different speech rates and different time durations of words when they are registered, and the parameter time sequence Qi(m) is time-normalized. Therefore, the distance between input utterance and reference speech patterns can be calculated by the simple process for calculating the Chebyshev's distance.
Since the above time normalization process attaches more importance to unsteady regions of utterances, it is less subject to a recognition error between partly analogous utterance patterns than the DP matching process.
Furthermore, the normalized parameter time sequence Qi(m) does not contain information indicative of different speaking rates. Thus, the time normalization process can easily handle global features of a parameter transition structure in the parameter space, lending itself to various methods effective for speaker-independent speech recognition.
The time normalization process described above is referred to as a NAT (Normalization Along Trajectory) process.
Japanese patent application No. 59-109172, for example, discloses a speech recognition system which carries out the NAT process a plurality of times to prevent a word recognition rate from being lowered due to utterance fluctuations and partial word similarities.
FIG. 5 of the accompanying drawings illustrates the disclosed speech recognition system in block form.
As shown in FIG. 5, an acoustic analyzing unit 2a receives an input speech signal from a microphone 1 and extracts acoustic parameters indicative of features of the input speech pattern at certain time intervals referred to as frame periods.
The feature parameter in each frame period is sent from the acoustic analyzing unit 2a to an utterance boundary detector 2b, which determines start and end points of the input utterance. The acoustic analyzing unit 2a and the utterance boundary detector 2b jointly make up an acoustic analyzer 2 equivalent to the acoustic analyzer 2 shown in FIG. 1.
More specifically, as shown in FIG. 6 of the accompanying drawings, when a speech recognition sequence is started, an utterance boundary of the input utterance is determined within a predetermined interval in one frame period in a step S.sub.1. As shown in FIG. 7, the start and end points of the input utterance are determined within an interval t.sub.1 which is sufficiently shorter than one frame period of 5.12 msec. Therefore, no processing is carried out in a remaining time interval t.sub.2 in the frame period.
Then, a step S.sub.2 determines whether the input utterance is finished or not. If not, then the processing goes back to the step S.sub.1. If finished, then the processing proceeds to a step S.sub.3 in which a first NAT process is carried out by a NAT processor 13 shown in FIG. 5. After the first NAT process, a second NAT process is carried out, if necessary, by the NAT processor 13. The first and second NAT processes will be described later.
After the first and second NAT processes, a pattern matcher 8 matches points of the trajectory 11 of the input utterance at sampling intervals Le with those of a reference speech pattern stored in a reference pattern memory 4 in a step S.sub.5. Thereafter, the pattern matcher 8 outputs a recognition result in a step S.sub.6, and the speech recognition sequence comes to an end.
Each of the first and second NAT processes will be described below with reference to FIG. 8.
As shown in FIG. 8, the NAT processor 13 first calculates the length L of the trajectory 11 in a step ST.sub.1. The length L may be calculated according to the process of calculating the Chebyshev's or Euclidean distance.
Thereafter, the NAT processor 13 determines the distance Le between adjacent sampling points on the trajectory 11 in a step ST.sub.2. The distance Le may be determined according to the equation Le=L/(M-1) or Le=L/M where M is the number of sampling points.
Then, the NAT processor 13 sets values k, L' to k=1, L'=Le, respectively, in a step ST.sub.3, and calculates the distance dk between a kth frame and a (k+1)th frame in a step ST.sub.4.
The NAT processor 13 determines whether dk-L'.gtoreq.0 or not in a step ST.sub.5. If not, then the NAT processor 13 sets the values k, L' to k=k+1, L'=L'-dk, respectively, in a step ST.sub.6, and the processing returns to the step ST4. If dk-L'.gtoreq.0 in the step ST.sub.5, then the NAT processor 13 determines a re-sampling point between two points composed of the data in the kth and (k+1)th frames in a step ST.sub.7.
Thereafter, the NAT processor 13 determines whether the number of re-sampling points is equal to a desired number or not in a step ST.sub.8. If equal, then the NAT process is ended. If not equal, then the processing goes to a step ST.sub.9 in which the NAT processor 13 sets the value L' to L'=L'+Le, and then goes back to the step ST.sub.5. In this manner, the NAT processor 13 determines as many re-sampling points as possible between two points composed of the data in the kth and (k+1)th frames.
According to the above NAT process, since the re-sampling intervals are calculated from the length L of the trajectory 11, it is necessary that the length L be determined prior to the NAT process. In order to determine the length L, the start and end points of the input utterance must have been detected.
FIG. 9 of the accompanying drawings shows the speech recognition sequence shown in FIG. 6 as it develops with time. The graph of FIG. 9 has a horizontal axis representing time t and a vertical axis representing a speech level. The speech recognition sequence for recognizing an input utterance 15 requires a time interval t.sub.3 in which to detect an utterance boundary from a start point 16 to an end point 17 of the input utterance 15, a time interval t.sub.4 in which to effect the NAT process, and a time interval t.sub.5 in which to effect the pattern matching, until it outputs a recognition result. The speech recognition sequence is time-consuming because the NAT process is allowed to start after the input utterance 15 is finished.
Furthermore, since the parameter time sequence which has been obtained by acoustically analyzing the input utterance has to be held from the start point 16 to the end point 17 of the input utterance, it is necessary to provide a memory capacity that is large enough to store the necessary parameter time sequence.