The present invention relates to a system and methods for employment in speech recognition.
It has been a long desired goal to provide a machine which recognises human speech and can act upon it, either to perform particular control functions or to transform the speech into written text.
In recent years considerable progress has been made towards this goal, firstly by the provision of systems which recognise individual words, and secondly by the provision of systems which recognise strings of words. This second set of systems often operate by assessing the likelihood of a received word being adjacent to other detected words based upon both the likelihood of the word and the grammatical rules and vocabulary of the language being recognised. Whilst some systems are now available which do this to a considerable degree of accuracy, all such systems are computationally expensive, requiring a great deal of processing power and high speed processing circuitry to perform the recognition task at sufficient speed, particularly in relation to the assessment of the received speech""s probability of correspondence to known stored alternatives.
One such known speech recognition system, as part of its statistical assessment of received speech, uses Hidden Markov Models (HMMs) and the evaluation of continuous probability distributions to calculate the likelihood of a particular frame of speech corresponding to a particular output state. Whilst such an evaluation system is effective, it can require up to 75% of the computational requirement of the whole recognition system.
An alternative system uses a discrete probability distribution (rather than the usual continuous one) for each possible output state. This is because with a discrete distribution a simple table look-up is all that is needed to determine the likelihood of each output state corresponding to the input speech. There is, however, a considerable reduction in accuracy, compared to the employment of continuous probability distributions.
This simplified system has itself been improved by the employment of a semi-continuous system or tied mixture system, in which each possible output state is given a probability based upon a weighted sum of a set of Gaussian components, rather than one of a small set of discrete values. This improves accuracy, but is still not on a par with continuous distribution systems.
In such systems of the prior art, evaluation of the likelihood of the various output states corresponding to the speech vector is achieved by evaluating the likelihood of each mixture component and then summing these likelihoods for the respective output state. Repeating this for all possible output states determines the likelihood of each output state, but is computationally very expensive.
The present invention is directed towards systems using continuous probability distributions and their methods and seeks to overcome some of the problems associated with them, such as their need for high processing speed and large amounts of processing capability.
According to a first aspect of the present invention there is provided a method of processing speech, the method comprising:
receiving the speech and determining therefrom an input speech vector (or) representing a sample of the speech to be processed; and,
determining the likelihoods of a number of possible output states (j) corresponding to the input speech vector (or), wherein each output state (j) is represented by a number of state mixture components (m), each state mixture component being a probability distribution function approximated by a weighted sum of a number of predetermined generic probability distribution components (x), the approximation including the step of determining a weighting parameter (wjmx) for each generic probability distribution component (x) for each state mixture component (m),
the method of determining the output state (j) likelihoods comprising the steps of:
1) generating a correspondence probability signal representing a correspondence probability (Prx), wherein the correspondence probability (Prx) is the probability provided by each respective generic probability distribution component (x) based on the input speech vector (or);
2) generating a threshold signal, representing a threshold value Tmix;
3) selecting a number of output states (Nj);
4) determining, for each state mixture component (m) of each selected output state (j), whether a weighted probability (gjmr) given by the scalar product of the weighting parameters (wjmx) and the respective correspondence probabilities (Prx), exceeds the threshold value Tmix; and,
5) generating a set of output signals representing state likelihoods (bj) for each selected output state (j) by evaluating the likelihoods of the state mixture components (m) of the respective selected output state (j) which have a weighted probability (gjmr) exceeding the threshold Tmix.
According to a second aspect of the invention, there is provided a method of processing speech, the method comprising:
receiving the speech and determining therefrom an input speech vector (or) representing a sample of the speech to be processed; and,
determining the likelihoods of a number of possible output states (j) corresponding to the input speech vector (or), wherein each output state (j) is represented by a number of state mixture components (m), each state mixture component being a probability distribution function approximated by a weighted sum of a number of predetermined generic probability distribution components (x), the approximation including the step of determining a weighting parameter (wjmx) for each generic probability distribution component (x) for each state mixture component (m),
the method of determining the output state (j) likelihoods involving determining whether a weighted probability (gjmr) exceeds a threshold value Tmix by determining whether a scalar product of the form:   S  =            ∑              i        =        1            K        ⁢                  A        i            xc3x97              B        i            
xe2x80x83exceeds the threshold T, where K is a predetermined integer, the determination comprising the steps of:
1) receiving a signal representing the value Ai, where Ai represents one of the weighting parameters (wjmx);
2) receiving a signal representing the value Bi, where Bi represents the correspondence probability (prx) generated from the respective generic probability distribution component (x);
3) generating first, second and third signals representing the values log(Ai), log(Bi) and log(T), respectively,
4) comparing the first, second and third signals and generating an output signal indicating that S greater than T if:
log(Ai) greater than Pxc3x97log(T) AND log (Bi) greater than Qxc3x97log(T)
where: 0 less than P less than =1 and 0 less than Q less than =1
5) if no output signal has been generated, repeat steps 1 to 4 for subsequent values of i.
According to a third aspect of the invention, there is provided a method of processing speech, the method comprising:
receiving the speech and determining therefrom an input speech vector (or) representing a sample of the speech to be processed; and,
determining the likelihoods of a number of possible output states (j) corresponding to the input speech vector (or), wherein each output state (j) is represented by a number of state mixture components (m), each state mixture component being a probability distribution function approximated by a weighted sum of a number of predetermined generic probability distribution components (x), the approximation including the step of determining a weighting parameter (wjmx) for each generic component (x) for each state mixture component (m),
wherein the method of determining the output state (j) likelihoods comprises determining a classification (Cjx) of each of the possible output states (j) for each generic component (x), the classification representing the likelihood (Lxm) of each output state (j) representing the input speech vector (or), the method of determining the classification comprising the steps of:
1) generating at least one threshold signal representing at least one threshold value Tgood;
2) selecting one of the predetermined generic components (x);
3) selecting one of the number of output states (j);
4) generating a likelihood signal representing the likelihood (Lxm) of the output state (j) being the output state representing the input speech vector (or) assuming that the selected generic probability distribution component (x) provides the highest unweighted probability for the input speech vector (or) of any of the generic probability distribution components;
5) comparing the threshold signal to the likelihood signal;
6) generating and storing a first or second classification signal representing the respective classification (Cjx) of the output state (j) in accordance with the result of the comparison of the threshold signal with the likelihood signal; and,
7) repeating steps 2 to 6 for all generic components (x) and all possible output states (j).
The present invention processes speech by generating a speech vector representing a sample of the speech to be processed, and then determining which of a number of possible output states most closely represents the speech vector. The grammar and dictionary together specify the possible sequences of states. The likelihood of the input speech matching different state sequences together with the known probability of different word sequences can be combined to find the word sequence which best matches the input speech.
The comparison of the speech vector with each of the possible output states is a computationally expensive task. In the invention, the cost of the calculation can be reduced by simplifying the calculation required for each of the possible output states.
Accordingly, the invention uses a broad state classification which can be determined using a predetermined look up table. This indicates the approximate likelihood of each output state depending on which of the generic probability distribution components provides the highest unweighted probability for the input speech vector.
This classification can then be used to control the accuracy with which a state is evaluated. Thus, for example, when the state is very unlikely a simple approximation, such as the use of a constant value, is acceptable. If the state is somewhat unlikely more accuracy is required and just one of the many mixture components comprising the state probability distribution can be evaluated and used to approximate the actual state likelihood. Finally, evaluation of the more likely states uses a simplified (but approximate) mechanism for determining which of the state""s many mixture components need to be evaluated to maintain the accuracy of the final state likelihood value.
The invention also uses a small number of generic distributions together with state mixture component specific weights to approximate the actual state distribution. Using this technique the approximation for each state mixture component is evaluated by comparing the scalar product of the state mixture component specific weights with the unweighted likelihood provided by each of the generic probability distribution functions with a fixed threshold. This procedure is repeated for each component of each state being considered and only for those mixture components for which the product exceeds the threshold does the accurate likelihood need to be calculated.
The invention also allows fast determination of whether a scalar product exceeds a threshold. By using single bit approximations many terms in the scalar product can be combined into a single computer word and evaluated in one operation.
Whilst all the aspects of the present invention may be employed separately, it is also possible to use any combination of the aspects in order to maximise the computational efficiency of the procedure.