Improved pattern mtching has been achieved by using stochastic models of words instead of simple templates, assuming that speech can be approximated by hidden Markov processes (see "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition" by Levinson, Rabiner and Sondhi, Bell Systems Technical Journal, Vol. 62, No. 4, April 1983, pages 1035 to 1074).
Briefly, incoming sounds are frequency analyzed by, for example, a bank of filters and the resulting signal levels in each filter are smoothed to provide estimates of the short term power spectrum (called frames) typically every 10 ms. These signals are used after further processing together with a number of probability density functions (p.d.f.s) to give the probabilities that the incoming signal producing the channel outputs corresponds to a state in a Markov model which is a finite state machine representing a word to be recognized. Each Markov model comprises a number of states and there is in general one p.d.f. for each channel relating to each state, the p.d.f.s. being obtained by previously training a recognizer using examples of words to be recognized. In operation, the recognizer employs the Markov models to calculate the word most likely to have been uttered by considering the likelihoods that a current sound arises from each of the states and the probabilities of transition from one state to another within a Markov model. The Viterbi algorithm may be used in finding the most likely word on this basis.
In practice the negative of the logarithm of a likelihood is used and in this specification is referred to for brevity as "distance", by analogy with Dynamic Time Warping (DTW) recognizers. The state p.d.f.s are often assumed to be multivariate normal distributions with diagonal covariance matrices, and so are characterised by a mean, m, and variance, s, for each filter-bank channel. This is a crude approximation to the speech signal that is currently in widespread use. The theory presented in this specification is equally applicable to stochastic models with p.d.f.s that are not multivariate normal distributions with diagonal covariance matrices.
In this specification the word "input" means the input to a speech recognizer during operational use, and "cell" means the level in a particular filter-bank channel or equivalent in a particular frame, in either input or a training. Filter-bank analysis is usually preferable for the present invention because methods of acoustic analysis that do not keep the different parts of the specturm separate (e.g. Linear Predictive Coding or cepstrum methods) are not so amenable to noise compensation. These other methods of acoustic analysis mix together noisy parts of the signal spectrum with components caused mainly by speech, and it is not then possible to identify which parts of the specturm are contaminted by noise.
The background noise signal needs to be estimated as it varies with time. This can be done by using the microphone signal when no speech is present. It can also be done using a separate microphone which only gives the noise signal.
It has to be accepted that in conditions of high noise it is not possible in principle to distinguish between words that differ only in low-level regions of the spectrum, where they are seriously contaminated by noise. A technique is required which makes full use of any speech information in the high-level parts of the spectrum that can act as true evidence for word identity, but ignores any information that is too corrupted by noise to be useful.
When the speech in the training phase is completely uncontaminated by noise and the input cell, f, is above the input noise level, then in the case of a multivariate normal distribution with diagonal covariance matrix, the p.d.f. for each channel has the form: ##EQU1## The distance is therefore: ##EQU2##
However the situation is very different when the input cell is known to be noisy. Its actual value is unlikely to be sensibly related to the underlying signal, and may, in fact even be quite low because of change cancellation of signal by the noise. It is therefore necessary to use a different method to derive a distance measure for noisy input cells.
According to a first aspect of the present invention there is provided apparatus for use in sound recognition comprising
means for deriving a plurality of input signals during recognition which are each representative of signal levels in respective regions in the frequency spectrum,
means for storing a plurality of groups of p.d.f. values representing probability density functions indicating the likelihoods that input signals arise from states in finite state machine models of groups of sounds to be recognized,
means for estimating the input noise level, and
means for recognizing sounds from the input signals, the stored p.d.f. values and the models, employing respective distance measures, each derived from one input signal and one p.d.f. as represented by one group of said values, each distance measure representing a likelihood of obtaining a region signal level from one p.d.f., when the input signal is above a predetermined level related to the noise level in the corresponding spectrum region, and representing the cumulative likelihood of obtaining from the said p.d.f. a region signal level below the said predetermined level, when the input signal is below, or equal to, the predetermined level.
The groups of sounds are usually words where the apparatus is for speech recognition and the spectral regions are usually channels.
An advantage of the first aspect of the invention is that the input signals are used in obtaining the distance measures when they are reliable; that is when they are above the predetermined level which is usually near to, or at, the level. The predetermined level is used, instead of the input signals, when the input signals are unreliable. This is because the input signals are near to or below the noise level so there is no reliable information about the level of the underlying speech signal, except that it was not above the noise level. Using the cumulative distribution of the p.d.f. over all levels up to the noise level therefore gives a more reliable comparison between states than using the probability derived from the p.d.f. at the noise level.
The means for recognizing sounds may comprise means for deriving masked input signals during recognition by representing any channel of an input signal below noise level with a masking level which is representative of the noise level in that channel.
The means for estimating the input noise level may comprise a separate microphone recording the noise signal alone or means for differentiating between noise only and noise plus speech on a single microphone.
Several different distributions may be found useful in calculating likelihoods and cumulative likelihoods but the normal distribution is usually used in speech recognition. Assuming the normal distributions each likelihood measure is preferably derived from -ln [N(f,m,s)] when the noise level is below the input signal and from -ln[erf((A-m)/s)] when the noise level is above the input signal; where A is the noise level in the spectrum region corresponding to the input signal, the known cumulative distance function ##EQU3## and N(x,0,1) corresponds to a normally distributed p.d.f. with independent variable x, mean equal to zero and variance equal to one.
The invention also includes methods corresponding to the first aspect of the invention.
Another problem arises in deriving groups of values representing p.d.f.s where in training the sample utterances are somewhat contaminated by noise. This particularly important in environments where the voice quality changes because of the noise, or where noise and voice quality are inseparable consequences of the environment. Examples are stress-induced voice changes in aircraft, particularly in an emergency, and shouting in high noise levels. Any solution to this problem should also give useful improvements in less severe noisy environments.
If a large proportion of the measurements used to derive and one p.d.f. are corrupted by noise, there is no prospect of making reliable estimates of the underlying speech distribution. It is, however, important for any channel that such evidence as there is to suggest that different states have different underlying distributions should be taken into account in estimating the state parameters.
Therefore, according to a second aspect of the present invention there is provided a method of training a sound recognition system comprising
deriving a plurality of groups of input signals from repetitions of nominally the same sound, each group being representative of signal levels in respective regions in the frequency spectrum, and
deriving a plurality of groups of p.d.f. values representing probability density functions indicating the likelihoods that input signal arise from states in finite state machine models for a vocabulary of groups of sounds to be recognized,
the p.d.f. values being derived only from input signals above the noise levels in corresponding specturm regions, and the derivation being so carried out that the groups of values represent substantially whole probability functions although obtained from input signals above noise levels only.
Preferably the noise level used in each region of the frequency spectrum is the highest found in deriving the input signals for that region for all training repetitions of all sounds in the vocabulary.
If a normal distribution is assumed for each p.d.f. and the p.d.f.s. are assumed to be uncorrellated and each group of values comprises the true mean m and the true variance s.sup.2, then m and s may be estimated from ##EQU4## where B is noise level, M is the mean of samples above the noise level, F is the proportion of input signals below the noise level, erf(F) is as defined above, and EQU Q(F)=N(erf.sup.- (F),0,1).
In practice Q(F) and erf.sup.-1 (F) can be found by look-up in a table of pre-computed values.
If more than half of the training cell measurements used for a state are identified as noisy, it is implied that the underlying mean is in fact below the noise level. It is then unwise to use only the tail of a distribution in an attempt to estimate the true mean and variance.
According to another feature of the invention therefore a constant mean and a constant variance are substituted for the said values in any said group representing a p.d.f. derived in training in which the proportion of input signals which are noise exceeds a predetermined value, greater than 0.5, and typically equal to 0.8.
Where the proportion is below the predetermined value equations 1 and 2 may be used, but preferably for a range of proportions between for example 0.5 and 0.8, smooth transition for mean and variance values is arranged without discontinuities by replacing B in equations 1 and 2 with a function dependent on B and F derived from a look-up table, and by appropriately modifying the tables for erf.sup.-1 (F) and Q(F) in this range of F values.
It is also preferable to add a standard minimum variance to all computed variances to overcome the danger with all stochastic models of limited training leading to the computed variances being too low by chance. If the standard minimum variance is chosen to be very large, the variances for all states are in effect made equal and the distance measure reduces to the squared Euclidean distance measure that is widely used in DTW matching. If it is made too small, there is the danger of attaching too much significance to unreliable statistics that arise through inadequate training. It is therefore desirable to set the standard minimum variance by experiment to optimise the performance under any given practical conditions.
An advantage can be obtained, if in deriving the said groups of values the standard minimum variance for a particular p.d.f. is scaled by a function of the number of input signal samples used to derive said group of values of that p.d.f. since a variance derived from a large number of samples is more likely to represent the true speech variability than an equal variance derived from only a few samples.
The invention also includes apparatus corresponding to the methods of the second aspect of the invention.