1. Field of the Invention
The present invention relates to a speech recognition method and apparatus therefor, and more particularly, to a speech recognition method and apparatus therefor for recognizing a speech, such as a word, uttered continuously by an unspecified speaker.
2. Description of the Prior Arts
Among various types of known unspecified speaker recognition techniques, the most commonly used unspecified speaker recognizing system will be described below.
FIG. 15 shows the configuration of a recognition system which handles large unspecified vocabularies. A speech input from a speech input unit 1 is sent to a speech analysis unit 2 where a filter bank output including the power term of a speech or feature parameter, such as LPC cepstrum, of the input speech is obtained. Compression (dimension compression by the K-L transform in the case of the filter bank output) of the parameters is also conducted in the speech analysis unit 2. Since analysis is conducted by the unit of a frame, the compressed feature parameter is hereinafter referred to as a feature vector.
Next, the phoneme boundary is determined in the continuously uttered speech by a phoneme boundary detecting unit 3. Subsequently, a phoneme discriminating unit 4 determines phonemes by a statistical technique. A reference phoneme pattern storing unit 5 stores reference phoneme patterns created from a large amount of phoneme samples. A word discriminating unit 6 outputs a final recognition result from a word dictionary 7 using the results of the output of the phoneme discriminating unit 4 or by performing modification on the candidate phonemes by means of a modification regulating unit 8. The results of the recognition are displayed by a recognition result display unit 9.
Generally, the phoneme boundary detecting unit 3 uses functions or the like for discrimination. The phoneme discriminating unit 4 also conducts discrimination using the functions. Candidates which satisfy a predetermined threshold are output from each of these components. A plurality of phoneme candidates are output for each phoneme boundary. Therefore, the word discriminating unit 6 narrows a final word using the top-down information stored in the components 7 and 8.
However, since the aforementioned conventional recognition system basically has a bottom-up structure, in a case when errors are generated at a certain point in the recognition process, the following process will be readily affected adversely. For example, in the case when phoneme boundary is erroneously determined in the phoneme boundary detecting unit 3, the operation by the phoneme discriminating unit 4 or the word discriminating unit 6 may be greatly affected. That is, the final speech recognition rate is lowered in proportion to the product of the error rates of the individual processes. It is therefore impossible to attain a high recognition rate.
Furthermore, in the case of a recognition apparatus designed for the recognition of unspecified speakers, setting of a threshold value used for determination made in each process is very difficult. Setting of a threshold value which ensures that an objective is contained in the candidates increases the number of candidates in each process and hence makes accurate narrowing of the plurality of candidate words very difficult. Furthermore, when the recognition apparatus is used in an actual environment, unsteady-state noises are generated to a large excess, thus lowering the recognition rate even for a recognition apparatus designed to handle a small number of words.