1. Field of the Invention
This invention relates to a speech recognition using neural network.
2. Description of the Related Art
Speech recognition apparatuses using neural networks are currently known. This type of speech recognition apparatus learns recognition object speech data sequence before hand. When the input speech data sequence matches with the recognition object speech data sequence, the speech recognition apparatus outputs a speech recognition signal.
This conventional speech recognition apparatus is required to be simple in learning for recognizing the speech data sequence and to have a high degree of recognition precision for actual input speech data sequence. Even when recognition object identical speech data sequence is successively inputted, it is required to accurately recognize how many speech data sequence were successively inputted.
However, the conventional speech recognition apparatus did not satisfy the foregoing requirements completely.
The methods for practical use in the conventional speech recognition apparatus are chiefly grouped into two methods, i.e. a DP matching method and a hidden Markov model (HMM). These methods are described in detail in, for example, a book "Speech recognition By Stochastic Model" by Seiichi Nakagawa.
In short, in the DP matching process assumes the correspondence between the beginning and terminating ends of input and standard data, the contents thereof being transformed by the use of various time normalizing functions. Distance between the pattern transformed to have smallest difference and the standard pattern is judged to be lost points for the standard pattern. From a plurality of stadard patterns, a standard pattern having the minimum number of lost points is selected to be the results of matching.
On the other hand, the HMM process performs the speech recongnition through a stochastic process. An HMM model corresponding to a standard pattern in the DP process is established. One HMM model comprises a plurality of states and a plurality of transitions. Existence probability is given to the respective one of the states while transition and output probabilities are provided to the respective one of the transitions. Thus, a probability at which a certain HMM model generates a time series pattern can be calculated.
However, both in the DP matching method and in the HMM method it is required to define the start end and the terminal end of speech data sequence inputted during learning and speech recognition.
For performing speech recognition process which appears not to be dependent on the start end and the terminal end, it is necessary to find the start end and the terminal end by trial and error, taking a very long time for the process. For example, assume that data belonging to a certain category is to be detected from a pattern of length N. In this case, the start end position has possibilities to N order while the terminal end position has possibilities to N order. Namely, a combination of the start and terminal ends have possibilities to N.sup.2 order. Therefore, for all of very many combinations, recognition process finding the start end and terminal end which give the best result have to be conducted by trial and error, thus taking a very long time for the process.
The conventional art has a more essential problem with assuming the existence of the start and terminal ends, compared to the quantitative problem with the number of combinations of the start and terminal ends. Namely, under the condition that only a single piece of data of a particular category is contained in the input data, the start and terminal ends are definite. However, in the present circumstances, such condition can scarcely be realized. In the case where input data contains consecutive data of different categories, their borders are indefinite. Furthermore, in time series information such as speech, there definitely do not exist borders between data, and consecutive data of two categories changes from one side to the other via a transition region where the information overlaps.
Therefore, from an accuracy view point, there is a very significant problem with creating standard data by the data assuming the start and terminal ends and performing learning of parameters of the HMM method by such data.
In order to solve these problems, various ideas specified for particular problems have long been cherished, or good results cannot be obtained. Such ideas have not been known generally.
As another conventional art, the MLP method using back propagation learning algorithm and multilayer perceptrons is known. This method is disclosed in, for example, a book "Speech, Auditory Perception and Neural Network Model" (Ohm Co., Ltd.) written by S. Nakagawa, K. Shikano and Y. Tohkura.
The MLP method is basically a method of recognizing static data. In order to recognize time series data, the temporal structure of the input data must be reflected in the structure of neural network. The most popular measure for this method is to input data of a certain time range as a single input data and to process temporal information equivalently. This time range should be fixed in view of the structure of MLP.
However, the length of actual time series data varies greatly, depending on the category or even in the same category. For example, regarding phonemes in speech, the average length of vowels or long phonemes is different over ten times from that of plosives or short phonemes. Even in the same phonemes, the actual length in speech fluctuates over two times. Consequently, assuming that the input range of data is set to an average length, if a short phoneme is to be discriminated, many speech data sequence other than recognition object data are contained in the input data. If a long phoneme is to be discriminated, only part of recognition object data is contained in the input data. Any of the these would be a cause to lower the recognition ability. Even if a different length is set for every phoneme, the length of the phoneme itself would vary, which is nothing to solve the problems.
In the MLP method, since the start and terminal ends of the data input range must be defined, it is difficult to perform accurate speech recognition during actual recognition action in which the input data length fluctuates.
In addition, if detection object data, for example data A are contained plurally in the input data, it is impossible to definitely detect how many data A exists in the input data. This problem would be particularly great when the speech recognition apparatus is used for the case where data consists of continuous input of an identical recognition object category.