1. Field of the Invention
The present invention relates to a speech recognizing apparatus.
2. Description of the Prior Art
With a recent progress of the art of speech recognition, not only a speech recognizing apparatus for recognizing a short utterance such as a syllable or a word, but also a speech recognizing apparatus for recognizing a long utterance such as a clause or a sentence which is generally referred to as a continuous speech recognition) have been developed. In the short utterance recognizing apparatus, a high performance apparatus has been realized by recognizing with the use of spectrum information of the speech. However, when it comes to the continuous speech recognition, since the long utterance period or length often results in a considerable deformation of the spectrum, it is not possible to maintain a high recognizing performance only with the spectrum information. Therefore, attempts have been made to improve the performance by the additional utilization of voices other than the spectrum information which have not hitherto been taken into consideration.
One of them is a recognizing method with the use of information of a speech duration. In the continuous speech recognition, since the utterance period or length is long, the recognition of all of the utterance period or lengths at one time is inefficient and therefore it is a general practice to employ the method in which the utterance periods are divided into a plurality of recognizing sections convenient for the speech recognition. With this recognizing method, a control of the duration of each speech period can result in an accomplishment of a high performance recognition without a result of recognition of an unnatural duration being outputted.
Hereinafter, the prior art speech recognizing apparatus of the above described type will further be discussed with reference to FIG. 10.
In FIG. 10, reference numeral 1 represents a speech input terminal to which a speech signal is inputted; reference numeral 2 represents an analyzer; reference numeral 3 represents a endpoint detector; reference numeral 4 represents a next syllable predicator; reference numeral 101 represents a matching unit; reference numeral 7 represents a recognition result output terminal; reference numeral 8 represents a standard speech spectrum calculator; reference numeral 102 represents a standard speech duration calculator; reference 10 represents a grammar and rule storage buffer; reference numeral 12 represents a standard speech storage buffer; reference numeral 104 represents a standard speech duration information storage buffer; also included is an input speech storage buffer; and reference numeral 14 represents a switch.
The prior speech recognizing apparatus of the above described construction operates in the following manner. At the time of a standard speech learning, a standard speech spoken in units of sentences is divided in terms of syllables and a speech for each syllable is inputted from the speech input terminal 1. The analyzer 2 then analyzes spectrum information necessary for a recognition. For the spectrum information referred to above, if, for example, the LPC cepstrum method is used, the LPC cepstrum coefficient comprising a set of a predetermined number of items for each frame is calculated as a characteristic parameter. The above described analyzing cycle is repeated until a predetermined number of learning speech data terminates. Then, the data analyzed for each syllable are clustered in the standard speech spectrum calculator 8 and data of interest in each cluster are stored in the standard speech spectrum storage buffer 12. The standard speech duration calculator 102 collects the durations of the durations of the learned speech in units of frames which are subsequently stored in the standard speech duration information storage buffer 104.
During the speech recognition, the speech signal is inputted through the speech input terminal 1 and the analyzer 2 analyzes the spectrum information for each frame. A method of this analysis is similar to that during the learning. Then, the endpoint detector 3 detects a speech period using an LPC cepstrum zero-order coefficient in the analyzer (It is to be noted that the zero-order coefficient is indicative of speech power information.). The speech period includes the following two conditions.
(1) The speech power (zero-order coefficient value) is greater than a predetermined value.
(2) A frame satisfying the condition (1) above succeeds in a number greater than a predetermined value.
Thereafter, the next syllable predicator 4 selects the syllable to be subsequently recognized for each frame with the use of the grammar and rules. By way of example, where the grammar and rules to be used are a context free grammar, the grammar and rule storage buffer 10 stores a dictionary of all words to be recognized and a tree structure of junction among the words, an example of which is shown in FIG. 11. When a recognizing process is to be carried out along a time axis, a syllable which would be neighboring the syllable candidate of the frame previously recognized is employed as the next syllable candidate.
The matching unit 101 performs a matching between the standard speech of the syllable candidate selected as described previously and the input speech. The matching is to determine the frame m and the syllable n which minimize a left term D of the following equation (1) by limiting the matching period to the maximum and minimum values of the duration for each syllable collected during the learning process. A high-ranked number m of syllable candidates which minimizes the distance D in the left term of the equation (1) is stored as a result of recognition in the recognition result storage buffer 12 together with the distance D. A result of this storage is used when the next succeeding syllable candidate is to be predicated. EQU D(i)=min[D(j)+Dn(j+1:i)] (1)
wherein D(i) represents the distance between the standard speech syllable row and the input speech to the i-th frame and Dn(j+1:i) represents the distance between the syllable n of the standard speech and the input speech from the (j+1)-th frame to the i-th frame. It is to be noted that the minimum value of the duration of the syllable n is smaller than the difference (i-j) which is in turn smaller than the maximum value of the duration of the syllable n.
Thereafter, a process of predicating and matching of the next succeeding syllable candidate is carried out from the start to the end of the speech period and the row of the syllable candidates which assumes the maximum value of the score S is outputted from the recognition result output terminal 7. The switch 14 operates to output the characteristic parameter to the standard speech spectrum calculator 8 during the learning process and to the endpoint detector 3 during the recognition process.
However, it has been found that the prior art speech recognizing apparatus has many problems. Specifically, since the duration of each syllable is controlled by the absolute value of the duration, an erroneous duration tends to be set if the speed of speech differs between the input speech and the standard speech. In order to accommodate all possible speeds of speech, a control of all possible speeds is necessary, resulting in a reduced processing efficiency and the learned speech tends to become bulky.
Also, with the above described construction, since the control of the duration is carried out by closing for each syllable, there is a problem in that, even when the difference in duration between the neighboring syllables is unrealistically large, it tends to establish a recognition candidate when the score is large.