1. Field of the Invention
The present invention relates to a speech recognition apparatus and method capable of obtaining a high recognition capability using a small number of learning patterns.
2. Description of the Related Art
An information I/O operation using speech is very natural to man and is superior in view of a man-machine interface such as provided by a keyboard or the like. For this reason, various speech I/O systems have been studied. Almost all available speech recognition apparatuses use pattern matching for recognizing word speech, and are arranged as shown in FIG. 1.
As shown in FIG. 1, acoustic analyzing section 1 converts an utterance into an electrical signal using a microphone arranged therein, and acoustically analyzes the resulting electrical signal using BPF (band pass filtering) analysis or LPC (Linear Prediction Coding) analysis. Beginning frame/ending frame (end) detector 2 detects a word speech interval of the analyzed signal. Standard pattern dictionary 3 prestores standard patterns of words to be recognized. Pattern matching section 4 computes a similarity or distance between the analyzed signal and the prestored standard patterns of the dictionary 3 (e.g., Euclidean distance, a sum of absolute values in DP (Dynamic Programming) matching, and the like). Determining section 5 determines the recognition results using similarity or distance values computed in pattern matching section 4. For example, section 5 selects a category name of a prestored standard pattern having the highest similarity to the analyzed signal as a recognition result of the input speech.
However, in speech recognition based on the pattern matching method, words can be spoken at a rate different from the rate at which prestored standard patterns are provided. As a result, a displacement (pattern deformation) between an input speech pattern and a prestored standard pattern along a time reference frame or time axis used to correlate the two patterns poses a problem. The conventional system overcomes the displacement along the time axis by linear expansion/compression or nonlinear expansion/compression such as Dynamic Programming.
Furthermore, a subspace method has been proposed Such a subspace method is described, for example, in IEEE ICASSP'78 (International Conference on Acoustic Speech and Signal Processing, by M. Jalanko and T. Kohonen, pp 561-564 April 1978). In the subspace method, an orthogonalized dictionary is created based on learning patterns acquired in advance, and speech recognition is performed using the created orthogonalized dictionary. FIG. 2 shows the arrangement for performing the subspace method. Acoustic analyzing section 1 and end detector 2 have the same arrangements and functions as those corresponding elements shown in FIG. 1. Sampling point extracting section 6 extracts a predetermined number of sampling data points obtained by equally dividing a word speech interval of the analyzed signal detected by end detector 2, and obtains a standard learning pattern represented by the number of feature vectors x the number of sampling points. A predetermined number of such standard learning patterns are acquired in sets of categories (word recognition, syllable) to be recognized, and are stored in pattern storage section 7. Gram Schmidt orthogonalizing section 8 (hereinafter referred to as "GS orthogonalizing section") creates orthogonalized dictionary 9 using the predetermined number (three or more) of standard learning patterns stored in storage section 7 as described below.
Assume that an mth learning pattern of a given category defined as a.sub.m, and learning patterns generated three times are used. The subspace method prepares the orthogonalized dictionary using the operations listed below.
(i) With first learning data a.sub.1 defined as dictionary data b.sub.1 of a first axis, the following relation is registered in orthogonalized dictionary 9: EQU b.sub.1 =a.sub.1. (1)
(ii) The following computation is performed based on second learning data a.sub.2 using a GS orthogonalizing equation: EQU b.sub.2 =a.sub.2 -[{a.sub.2 .multidot.b.sub.1)b.sub.1 }/.parallel.b.sub.1 .parallel..sup.2 ]. (2)
When .parallel.b.sub.2 .parallel. is larger than a predetermined value, .parallel.b.sub.2 .parallel. is registered in orthogonalized dictionary 9 as b.sub.2 of a second axis. In equation (2), (.multidot.) indicates an inner product T, and .parallel. .parallel. indicates the norm.
(iii) The following computation is performed based on third learning data a.sub.3 : ##EQU1##
When .parallel.b.sub.3 .parallel. is larger than a predetermined value, .parallel.b.sub.3 .parallel. is registered in orthogonalized dictionary 9 as dictionary data b.sub.3 of a third axis. However, if the dictionary data of the second axis is not yet obtained, computation of equation (2) is performed.
Operations (i) through (iii) are performed for each category to prepare dictionary data for the orthogonalized dictionary.
Similarity computing section 10 computes the following equation between each of the dictionary data of the orthogonalized dictionary 9 created as described above and each input speech pattern X: ##EQU2##
As a result, a similarity with each orthogonalized dictionary data b.sub.i,r of category i is computed. Input speech pattern X is recognized in accordance with the computed similarity. Note that orthogonalized dictionary data b.sub.i,r of category i are normalized in advance. K.sub.i indicates the number of axes of the orthogonalized dictionary data.
However, in the method using the GS orthogonalization, a deviation borne by each orthogonal axis is not clear. More specifically, in orthogonalization, a variety of orthogonal axes can be considered, and a pattern deviation changes depending on the orthogonal axis considered. For this reason, standard patterns represented by dictionary data {b.sub.i,1, b.sub.i,2, b.sub.i,3 } of Category i of the orthogonalized dictionary computed as described above do not always accurately represent the original standard learning patterns of category i.