1. Field of the Invention
The present invention relates to a generating system using hidden Markov models for analyzing time series signals such as voice signals, and more particularly, to a system providing non-linear predictors comprised of neural networks.
2. Description of the Related Art
Conventionally, various systems for analyzing time series signals such as voice signals have been developed.
As an example of such an analyzing system for time series signals, in FIG. 8, there is shown an apparatus for speech recognition using a hidden Markov model (hereinafter, referred to as HMM).
In the system shown in FIG. 8, a speech analyzing section 101 transforms input voice signals into a series of feature vectors using a known method such as a filter bank, Fourier transformation, or LPC (Linear Predictive Calculation) analysis. Each feature vector is formed for every predetermined time period (hereinafter referred to as a "frame"), for instance, 10 msec. Accordingly, the input voice signals are transformed into a series x of feature vectors x.sub.1 to x.sub.T wherein T is a number of frames. A section 102 is denoted a "code book" and has stored therein representative vector labels.
A vector quantizing section 103 replaces respective feature vectors of the vector series x with the representative vector labels estimated to be nearest thereto, respectively, with reference to the code book 102.
The series of labels thus obtained is sent to a probability calculation section 106. This section 106 calculates a generation probability of the label series of an unknown input speech using HMMs stored in an HMM memory section 105.
These HMMs are formed in advance by an HMM forming section 104. In order to form HMMs, an architecture of the HMM such as a number of states, and transition rates allowed between individual pair of states, are first determined. Thereafter, a plurality of label series obtained by pronouncing a word many times are learned and generation probabilities of respective labels generated according to the architecture of the HMM are estimated so that generation probabilities of respective label series become as high as possible.
The generation probabilities calculated by the section 106 are compared with each other in a comparison and judging section 107 which distinguishes a word corresponding to the HMM and which provides a maximum generation probability among the HMMs corresponding to respective words.
The speech recognition using HMMs is effected in the following manner.
Assuming the label series obtained from an unknown input as O=o.sub.1, o.sub.2, . . . , o.sub.T and an arbitrary state series of a length T generated corresponding to a word v by the model .lambda..sup.v as s=s.sub.1, s.sub.2, . . . , s.sub.T,
a probability at which the label series O is generated from the model .lambda..sup.v is given by;
[exact solution] ##EQU1##
[approximate solution] ##EQU2##
or in a logarithmic form, ##EQU3##
wherein P(x,y.vertline..lambda..sup.v) is a simultaneous probability density of x and y in the model .lambda..sup.v.
Accordingly, the result of recognition is obtained using one of the equations (1) to (3), for instance the equation (1), as follows; ##EQU4## P(O,S.vertline..lambda.) is calculated in the case of the equation (1) as described below.
Assuming that a generation probability b.sub.i (O ) of a label o and a transition probability a.sub.ij from one state q.sub.i to another state q.sub.i (i,j are integers from 1 to I) are given to every state q.sub.i of the model .lambda., the generation probability of the label series O=o.sub.1, o.sub.2, . . . , o.sub.T to the state series S=s.sub.1, s.sub.2, . . . , s.sub.T in the model .lambda. is defined as follows; ##EQU5##
wherein a.sub.s0 s.sub.1 is an initial probability of the state s.sub.1 and s.sub.T+1 =q.sub.f is a final state in which no labels are generated.
Although individual input feature vectors x are transformed into labels in the above example, there is also proposed a method in which a probability density function of each feature vector x in each state is given without using labels. In this case, the probability density b.sub.i (x) of the feature vector x is used instead of b.sub.i (o) and the above equations (1), (2) and (3) are rewritten as follows;
[exact solution] ##EQU6##
[approximate solution] ##EQU7##
or, in a logarithmic form, ##EQU8##
The final recognition result v of the input voice signals x is given by the following equation in both methods provided that the model .lambda..sup.v (v=1 to V) has been prepared. ##EQU9##
in which X is a series of labels or a series of feature vectors in accordance with the method employed.
A typical conventional HMM used for speech recognition is shown in FIG. 9 in which q.sub.i, a.sub.ij and b.sub.i (x) indicates an i-th state, a transition probability from the i-th state to j-th state and a probability density of the label of the feature vector x.
In this model, the state q.sub.i is considered to correspond to a segment i of the speech corresponding to the HMM. Accordingly, the probability density b.sub.i (x) in the case that x is observed in the state q.sub.i is considered to be the probability density in the case that x is generated in the segment i and the probability density a.sub.ij is considered to be the probability in the case that x.sub.t+ 1 at time t+1 is again included in the segment i when x.sub.t at time t is included therein. According to the above, the following two points may be identified as drawbacks in the conventional HMM.
(1) Dynamic features, i.e. features in the time variation of feature vectors are not suitably represented, since parameters defining the function b.sub.i (x) are assumed to be time-invariant, for instance, in the case where the distribution of x is assumed to be a normal distribution, and accordingly they are given by a matrix covariant with average vectors.
(2) Since transition probabilities a.sub.ii and a.sub.ij are assumed to be constant regardless of the length of the tribus of the state q.sub.i in the conventional HMM, although the length t of the segment i is considered to be subject to a probability distribution, the length of the segment i is subject to an exponential distribution as the result thereof and, accordingly, this distribution can not properly represent an actual state.
In FIG. 10, a result of analysis with respect to a voice signal as shown in (a) thereof is shown in (c) thereof which is obtained by using an HMM as shown in (b) thereof.
As is apparent from comparison of (c) with (a), the resultant vectors exhibit unnatural jumps between adjacent states.
In order to solve the second problem, there have been proposed methods in which Poisson distribution and/or .GAMMA. distribution are used as a probability density function d.sub.i (.tau.) related to the length t of the tribus of the state q.sub.i.
However, these methods fail to completely solve the problems of the conventional HMM method.
In the meanwhile, it has been reported that the neural network model is very effective for pattern recognition and feed forward type neural network exhibits an excellent property to static patterns. However, it has not been impossible to apply the neural network to non-static signals, such as voice signals, accompanying non-linear expansion and contraction of the time axis.