1. Field of the Invention
The present invention relates to a speech recognition apparatus and method which can be commonly used by a plurality of members or persons belonging to a specific group and which exhibit high recognition performance with respect to speech inputs from the respective members of the group by using a relatively small amount of learning pattern data extracted from speech inputs from the members.
2. Description of the Related Art
A system for inputting/outputting data by using speech is more natural for persons than other systems and hence is superior thereto as a man/machine interface. For this reason, such a system has been studied in a variety of ways. Most of the currently available speech recognition apparata are apparata for recognizing a verbal speech input, and are arranged as shown in FIG. 1.
An acoustic analyzing section 1 converts an utterance into an electrical signal by using an incorporated microphone, and performs acoustic analysis on the basis of BPF (band pass filtering) analysis or LPC (linear prediction coding) analysis. A beginning frame/ending frame or word boundary detector 2 detects an interval of the spoken word. A standard pattern dictionary 3 prestores standard pattern data of words to be recognized. A pattern matching section 4 calculates a similarity, a distance (e.g., euclidean distance and the sum of absolute values (city block distance) in DP (Dynamic Programming) matching, and the likes between acoustic analysis data (feature data: speech pattern) of the input speech in the interval of the spoken word and the standard pattern data of the respective words prestored in the standard pattern dictionary 3. A determining section 5 determines the calculation result obtained by the pattern matching section 4. As a result, for example, the category name of a standard pattern having the highest similarity value is obtained as a recognition result with respect to the input speech.
In speech recognition using such a pattern matching method, however, a problem persists in a difference (pattern deformation) in the time base direction between input speech pattern data and standard pattern data stored in advance. For this reason, in a conventional apparatus, the difference in the time base direction is reduced by linear time warping or nonlinear time warping represented by a dynamic programming method.
In addition, a subspace method has been proposed. In this method, an orthogonalized dictionary is prepared from learning pattern data acquired in advance, and speech recognition is performed by using this dictionary. FIG. 2 shows an arrangement of the subspace method. Referring to FIG. 2, an acoustic analyzing section 1 and an a beginning frame/ending frame or word boundary detector 2 has the same arrangement and function as those of the corresponding components shown in FIG. 1. A sampling point extracting section 6 extracts a predetermined number of sampling points obtained by equally dividing a speech interval detected by the detector 2, and obtains standard pattern data represented by number of feature vectors.times.number of sampling points. Such standard pattern data are acquired in a predetermined unit for each category (word, syllable, phoneme, and the like) and are stored in a pattern storage section 7. A Gram-Schmidt orthogonalization section (to be referred as a GS orthogonalization section hereinafter) 8 prepares an orthogonalized dictionary 9 by using the predetermined units (three or more) of standard pattern data acquired in the pattern storage section 7.
Assume that mth learning pattern data for each category are data a.sub.m, and learning pattern data occurring three times are used.
(i) First learning pattern data a.sub.1 is set as dictionary data b.sub.1 of the first axis, and EQU b.sub.1 =a.sub.1 /.parallel.b.sub.1 .parallel. (1) PA1 (ii) By using second learning pattern data a.sub.2, EQU b.sub.2 =a.sub.2 -(a.sub.2.sup.T.b.sub.1)b.sub.1 ( 2) PA1 (iii) From third learning pattern data a.sub.3, ##EQU1## is calculated. If a norm .parallel.b.sub.3 .parallel. is larger than a predetermined value, the value b.sub.3 is normalized using the value .parallel.b.sub.3 .parallel. and is stored in the dictionary 9 as dictionary data b.sub.3 of the third axis. Note that if the dictionary data b.sub.2 of the second axis has not been obtained, equation (2) with a.sub.2 changed to b.sub.3 is calculated.
is stored in the orthogonalized dictionary 9, where .parallel. .parallel. denotes a norm.
is calculated according to a Gram-Schmidt orthogonalization equation. If a norm value .parallel.b.sub.2 .parallel. of data b.sub.2 is larger than a predetermined value, b.sub.2 is normalized using the value .parallel.b.sub.2 .parallel. and is stored in the dictionary 9 as dictionary data b.sub.2 of the second axis. Note that (.) represents an inner product; and T, transposition.
The processing from the item (i) to the item (iii) is repeatedly executed for each category to prepare an orthogonalized dictionary.
A similarity computing section 10 computes the following equation by using the orthogonalized dictionary 9 prepared in the above-described manner and input speech pattern data X: ##EQU2##
As a result, a similarity between the input speech pattern data X and orthogonalized dictionary data b.sub.i,r for a category i is recognized. Note that the orthogonalized dictionary data for the category i are normalized in advance In equation (4), K.sub.i represents the number of dictionaries (axes) for the category i.
By using this GS orthogonalization, the recognition performance can be greatly improved.
In an apparatus of this type, however, an orthogonalized dictionary is prepared for only a specific speaker. For this reason, every time another speaker attempts to use the speech recognition apparatus, the orthogonalized dictionary must be updated. Therefore, preparation of an orthogonalized dictionary by acquiring a large number of learning patterns from a large number of speakers is considered. However, the preparation of such a dictionary is very complicated, and hence it is difficult to obtain a dictionary having high recognition performance.
In the above-described speech recognition based on the subspace method using an orthogonalized dictionary, a problem persists in how to efficiently prepare an orthogonalized dictionary having high performance by using learning pattern data acquired from a plurality of speakers. In addition, a problem persists in how to efficiently acquire learning pattern data required for the preparation of an orthogonalized dictionary from a plurality of speakers to prepare an orthogonalized dictionary.