The invention relates to a speech recognition method and a system for carrying out the method.
A number of prior art speech recognition systems are known. Most commercial approaches use a hidden Markov model (HMM). In this model, short intervals of speech are processed using a probabilistic model of the likelihood of any given word or sub-word producing a given output. The short intervals of speech may overlap, and may be parameterised by spectral parameters, for example output from a filter bank, a discrete Fourier transform, or even the parameters from a linear predictive coding analysis of the input speech. The best match of the input speech to the model is then determined. The values of probability used in the model are generated using a training phase. This approach, being standard in the art, is conventional and will not be further described.
Many commercial packages use this approach together with a linguistic engine that uses information about the language spoken to cut down the likely possibilities. This approach has led to several packages achieving hit rates of about 97%. There is, however, a need to increase this figure.
An approach known as time encoded speech (TES), or TESPAR, has been described in GB 2020517, GB 2084433, GB 2162024, GB 2162025, GB 2187586, GB 2179 183, WO 92/15089, WO97/31368, WO97/45831 and WO98/08188, which are hereby incorporated by reference in their entirety. In this approach, speech is coded into a small number of symbols. Speech recognition systems using speech encoded in this way have been proposed, inter alia, in WO 97/45831 and GB 2 187 586. However, the approach does not appear to have been widely implemented; it is believed that high recognition rates have not been achieved with the system.
According to the invention there is provided a speech recognition method including
inputting speech to be recognised, PA1 encoding the input speech using time encoding, PA1 using a hidden Markov model to determine scores indicating how the input speech matches some or all of a plurality of speech elements, PA1 determining which, if any, speech element best corresponds to the input speech using the time encoded speech and the Markov scores, and PA1 outputting the speech element, if any, so determined. PA1 identifying the intervals between the occurrences of the input parameter crossing a given value, and quantising the lengths of the intervals, PA1 identifying the number of complex zeroes of the input parameter, up to a predetermined rank, in the said intervals, and PA1 recording the quantised lengths of the intervals and a measure of the said number of complex zeroes up to a predetermined rank as a representation of the variation of the input parameter. PA1 speech capture system for inputting speech to be recognised, PA1 a hidden Markov speech recognition system for determining scores indicating how the input speech matches some or all of a plurality of speech elements, PA1 a time encoded speech system for encoding the input speech, and PA1 a decision system for determining which, if any, speech element best corresponds to the input speech using the time encoded speech and the Markov scores. PA1 inputting speech to be recognised, PA1 encoding the input speech using time encoding, PA1 using a hidden Markov model to determine scores indicating how the input speech matches some or all of a plurality of speech elements, PA1 determining which, if any, speech element best corresponds to the input speech using the time encoded speech and the Markov scores, and PA1 outputting the speech element, if any, so determined.
The speech waveform may be characterised by fluctuations in pressure about a mean value, which will be considered the "zero" value for the purposes of time encoding, described below. The input function is therefore a single valued function that oscillates about a zero value with a finite range of frequencies. Such a band-limited function is ideally suited to TESPAR analysis.
Once the input device has recorded the speech waveform some form of pre-processing is usually in order. This may include filtering the signal to remove frequencies outside the bandwidth covered by speech. For frequency analysis using the HMM method the signal is then divided into short time segments (say 10 ms).
TESPAR can be used with a signal that is broken up into any length of time. Therefore, the signal can be divided up into short time segments in a similar manner to that used in the HMM. Alternatively, the signal can be divided up into separate words, phrases or even sentences. TESPAR can be used directly to divide up the signal according to some criterion. An example is finding the end points of an utterance. An example of how this can be achieved is to take short time segments and encode each segment into an `S` matrix. If the sum of the matrix elements for each time segment is found the result is a vector of numbers indicating how much sound is present in each. This can then be used to find the transitions between sound and silence and hence the end points of the utterance.
There are many ways in which the speech signal may be time encoded. An example of the time encoding procedure is now described. The first step is to divide the signal to be encoded into sections at the points where the signal crosses the zero line. These sections are referred to as epochs. Each epoch is then categorised according to its duration, the number of complex zeros that occur in its duration and the maximum amplitude of the signal. The epochs in the list are then assigned to particular groups and the resulting distribution of epochs in the different groups is used to characterise the encoded signal. In a simple case this could mean assigning each epoch to a group determined by its shape, duration and magnitude. The simple one-dimensional histogram of the number of epochs in each group is then used to characterise the signal.
The Hidden Markov Model (HMM) may take short segments of the input signal and Fourier transform them. The resulting spectrum may then be used to assign the time segment to a particular sub-phone. The sequence of these sounds may then be fed into the model and a probability output for each word considered. Thus a ranking of words is produced that specifies which word was most likely to have given rise to the observed speech waveform.
One possible method of enhancing the recognition process is to use the time encoded signal to provide additional input parameters for the HMM. One such possibility is for the time encoded signal to be used to determine the identity of the speaker so that the HMM parameters may be modified accordingly.
Both the HMM and the TESPAR system produce probabilities for matches between the input speech and the speech elements in the systems vocabulary. TESPAR is, in addition, well suited to distinguishing between a predetermined selection of sounds. Thus if one model narrows the number of likely words corresponding to the input speech down to a number of possibilities the other model will probably be able to select which is the most likely from the shortlist. In this way the overall accuracy of the speech recognition system can be enhanced by including information from the time domain, in the form of TESPAR encoding, as well as information from the frequency domain.
Various methods exist for deriving scores for different speech elements using the TESPAR method. For example, correlation scores can be found between the matrix generated from the input signal and the archetype matrix for each speech element. More commonly a neural net can be trained, using known examples, to differentiate between different speech elements.
The time encoding may include the steps of
A predetermined rank of 1 has been found to give good results. In this case the method records the number of first rank zeroes, i.e. positive minima or negative maxima. This information may provide sufficient detail for useful characterisation without requiring excessive calculation.
The method thus parameterises the shape of the input parameter function. If the parameter rises smoothly to a maximum and then falls smoothly to the next zero, there will be no positive minima so said number will be zero.
If the function has an "M" shape, rising to a maximum, falling to a minimum and then rising to another maximum before passing through zero, then there will be one positive minimum so the said number will be one.
Thus, the number parameterises the number of oscillations of the input parameter between zeroes, i.e. in each epoch.
The reason that the positive minima or negative maxima are known as complex zeroes of a function is that they correspond to zeroes of the function for complex number inputs to the function. The first rank zeroes occur at real values being the real values of the complex numbers for which the function has a value zero.
The coding method may be a TESPAR method .
The method may further comprise the step of generating a code number taking one of a set of predetermined values representing the duration of the interval and number of maxima and minima for at least some of the said intervals.
The code numbers may be further parameterised. In one approach an S matrix may be calculated. The S matrix records the number of instances of each code number in the recorded variation of the input parameter. Alternatively or additionally an A matrix recording the number of instances of a first code number following a second code number with a predetermined lag may be calculated. A further alternative is to calculate a DZ matrix recording the number of instances of amplitude, length of interval and number of maxima and minima increasing, decreasing or staying the same in the next epoch.
The S, A and/or DZ matrices may be stored or evaluated.
The invention is based on the realisation that the time encoded speech engine can greatly improve the performance of existing systems. This is because its speech coding is essentially orthogonal to the parameters used in conventional speech processing. Moreover, the time encoded speech coding can be efficiently implemented, so the method can be performed with little computing power. Since the version of time encoded speech processing known as TESPAR can describe a word with only 26 symbols, its addition to existing speech processing systems allows the increase of their performance with little processing downside.
Preferably, the step of determining the best corresponding speech element also uses linguistic analysis of previously output speech elements.
The time-encoded method is preferably a system that encodes the speech based on intervals between zero-crossings and the number of maxima and minima in each interval. Further preferably, a reduced number of characters is selected to encode the intervals and the number of maxima and minima. The encoding method may be the TESPAR method described in the above-mentioned published patents.
A second aspect of the invention provides a speech recognition system comprising
In a third aspect, the invention provides a computer program recorded on a data carrier operable to control a computer system having a processing unit, a speech input and an output system, the program operable to control the computer system to carry out the method of