1. Field of the Invention
This invention relates to a recognition system of the kind which undertakes recognition of data by associating data vectors with predetermined models, and a method of recognition which involves associating such vectors and models; it is particularly relevant to speech and pattern recognition where distortion occurs prior to the recognition process.
2. Discussion of Prior Art
A speech recognition system is a good example of a recognition system in which the data or signal of interest undergoes some form of distortion prior to being available for recognition. In telephone applications in particular, a speech recognition system""s performance is often severely degraded by changes to the speech signal due to the position of the telephone handset or by the characteristics of the handset, telephone line and exchange. One particular problem concerns changes in the speech level caused by position of the handset. More sophisticated examination of the problem shows changes to the frequency balance are also significant. Compensation for changes to average signal level are often made by using some form of automatic gain control (AGC). Unfortunately it may be difficult to provide effective AGC; for example, in two wire telephone system configurations there are often substantial differences between the intensity levels of the speech of the persons engaged in the telephone conversation. In four wire configurations there may be significant reverse channel echo which is difficult to deal with. It arises from contamination of the speech of one party to the conversation with that of the other.
One approach to the problem of dealing with distortion is to train a speech recognition system using training data collected using a large variety of handsets and speaker positions. This approach suffers from two problems. First, in the world-wide telephone network there is a very large number of possible microphone types and speaker positions; in consequence the amount of training data required is far too large to be practical and the system is unable to optimise its performance on unknown microphones. Secondly, during recognition, only a small fraction of the training data is used effectively.
One approach to improving recognition performance is to apply some form of compensation to deal with distortion. Current speech recognition systems convert the input signal from a waveform in the time domain into successive vectors in the frequency domain during a process sometimes known as xe2x80x9cfilterbank analysisxe2x80x9d. It is possible to apply some form of compensation to these vectors. There are a number of methods which may be used to determine the appropriate compensation. One such method is disclosed by Sadaoki Furui, xe2x80x9cCepstral Analysis Technique for Automatic Speaker Verificationxe2x80x9d, IEEE Trans Acoustics, Speech and Signal processing, 29(2):254-272, April 1981. It involves averaging the output of the filterbank analyser for the entire conversation to obtain the long term spectral characteristics of the signal and applying a compensation for the distortions during a second pass over the data. The compensated data is then passed to the speech recognition device. There are two main problems with this approach. First, since a single correction is applied for the entire conversation it is poorly suited to conversations in which the distortion varies rapidly. This may happen in conversations from cellular, cordless or radio telephones. Secondly, since it is necessary to process the entire conversation to obtain the appropriate correction before recognition commences, it is unsuitable for real time applications.
A preferable approach is to use a technique sometimes known as spectral shape adaptation (SSA). A recognition system using this technique provides information on the expected spectral characteristics of the signal to be recognised at each time instant, and this is compared to the equivalent actually present in that signal to provide a difference term. The difference term is then averaged over a number of successive signals (time averaging) to provide a correction term. A system of this kind has been described by Yunxin Zhao, xe2x80x9cIterative Self-Learning Speaker and Channel Adaptation under Various Initial Conditionsxe2x80x9d, Proc IEEE ICASSP [11] pages 712-715. Here data is processed on a sentence by sentence basis. An input signal undergoes filterbank analysis to create successive vectors each indicating the variation in signal energy over a number of frequency bands. The vectors are processed by matching to speech model states. The parameters of the model state to which a vector has been matched are used to predict a value for that vector which would be expected according to the model. The difference between the vector and the predicted value is computed and time averaged with difference values obtained for earlier vectors from the sentence to determine the average distortion suffered by each sentence. The SSA parameters determined for one sentence are then used to process the next sentence.
Zhao""s approach unfortunately does not work in more sophisticated speech recognition systems, for the following reason. In these systems, data vectors (expressed in frequency space) obtained from filterbank analysis are transformed from the frequency domain to some abstract feature space. When correctly applied this transformation improves recognition accuracy, because it reduces unwanted contributions to the speech signal in the form of information which is characteristic of the speaker while preserving features which are characteristic of the words spoken. The model states are represented in the same feature space to which the vectors are transformed. It is normal practice to discard higher order terms in the transformation from frequency space to feature space to improve recognition accuracy as mentioned above, which means there is a reduction in dimensionality; ie feature space vectors have fewer dimensions or vector elements than frequency space vectors. This means that there is a loss of information in the transformation from frequency space to feature space, and therefore it is no longer possible to use the model parameters to provide a unique estimate of the expected value in frequency space because they contain insufficient information for this purpose. This means that compensation in the frequency domain cannot be implemented as described in the Zhao reference mentioned above.
It is an object of the invention to provide a recognition system with distortion compensation.
The present invention provides a recognition system for associating multi-dimensional data vectors with predetermined models of relatively lower dimensionality, and including:
a) compensating means for compensating for distortion in data vectors,
b) transforming means for applying a transformation to data vectors after distortion compensation to reduce their dimensionality to that of the models,
c) matching means for associating each transformed data vector with an appropriate model,
d) inverting means for obtaining a data vector estimation from the associated model by inversion of the said transformation, and
e) deriving means for deriving a compensation from the data vector estimation and the data vector to which it corresponds for use in distortion compensation by the compensating means.
The invention provides the advantage that it provides distortion compensation on the basis of model matching despite reduction in dimensionality. It has been discovered in accordance with the invention that it is possible to provide a data vector estimation for use in compensation despite loss of information prior to matching.
In a preferred embodiment, the inverting means is arranged to implement a pseudo-inverse of the said transformation and to provide an increase in model dimensionality to that of a data vector by including information in a manner such that operation of the transforming means upon the data vector estimation to reduce its dimensionality would result in loss of such information. This embodiment provides the advantage of relative ease of estimation, ie it has been found that a pseudo-inverse model transformation provides an acceptable estimation when information is included for this purpose in such a manner such that it is removed in subsequent operation of the transforming means.
The transforming means may be arranged to apply to data vectors a transformation represented by the function A( ) and the inverting means may be arranged to implement a pseudo-inverse transformation represented by the function Axe2x88x92( ), the functions A( ) and Axe2x88x92( ) satisfying the relationship: A(Axe2x88x92(A(q)))=A(q) where q is some arbitrary vector.
The deriving means may be arranged to derive a compensation from the data vector estimation and the data vector and preceding estimations and vectors of like kind. It may incorporate an infinite impulse response filter with an exponential time window implementing low pass filtering.
In a preferred embodiment, the system of the invention is arranged for speech recognition and each data vector has elements representing speech signal energy in a respective frequency interval. The deriving means may be arranged to produce compensation vectors for use in distortion compensation, and the compensating means may be arranged to add logarithms of data vector elements to logarithms of respective compensation vector elements. The transforming means is preferably arranged to apply a matrix transformation and the matching means to implement hidden Markov model matching; the inverting means may be arranged to produce data vector estimations from model states associated with transformed data vectors and having gaussian distributions. The matching means may employ model states which are mixtures of gaussian distributions and the inverting means may be arranged to produce data vector estimations therefrom.
The compensating means may alternatively provide for matrix multiplication to compensate for shifts in frequency space. The deriving means may be a Kalman filter.
The matching means may be arranged to implement segmental hidden Markov model matching.
The data vectors may at least partially comprise image information derived from a speaker""s lips, and the compensating means may provide compensation for at least one of illumination level, direction and geometrical distortions of the picture.
The transforming means is preferably arranged to apply a cosine transformation in which some coefficients are discarded to reduce data vector dimensionality.
A system of the invention for speech recognition in the presence of distortion preferably includes inverting means and deriving means arranged to provide compensation for at least one of:
a) varying speech signal level,
b) change in microphone position,
c) change in microphone type,
d) change in speech signal line characteristics,
e) background noise level,
f) frequency shifts,
g) speaker illumination level,
h) illumination direction, and
i) geometrical distortion of a speaker""s features.
The invention may alternatively provide compensation for distortions to signals other than speech. It may provide compensation for illumination level or view angle in a recognition system in which information consists partly or wholly of image information from a video camera pointing for example at a person""s face.
The deriving means may incorporate an infinite impulse response filter or a Kalman filter for combining contributions from a plurality of data vector estimations to derive a compensation for distortion in data vectors.
In a preferred embodiment of the invention, the matching means is arranged to indicate which of a plurality of model states and model classes are associated with each transformed data vector, the deriving means is arranged to derive a respective compensation for each data vector, and the compensating means is arranged to apply compensation selectively in accordance with model class indicated by the matching means. The matching means may be arranged to implement partial traceback and to indicate matched model states which may at some later time may become revised; in combination with the inverting means and the deriving means, it may provide correction for compensations produced on the basis of such matches.
In a further aspect, the present invention provides a method of associating predetermined multi-dimensional models with data vectors of higher dimensionality than the models, and including the steps of:
a) compensating for distortion in data vectors,
b) applying a transformation to data vectors after distortion compensation to reduce their dimensionality to that of the models,
c) associating each transformed data vector with a respective model,
d) inverting the said transformation to obtain a data vector estimation from the associated model, and
e) deriving a compensation from the data vector estimation and the data vector to which it corresponds and using the compensation to compensate data vectors for distortion.
Inverting in step (d) is preferably implemented by means of a pseudo-inverse of the said transformation to provide an increase in model dimensionality to that of a data vector by including information in a manner such that application of the transformation to the data vector estimation to reduce its dimensionality results in loss of such information.
In an alternative aspect, in which transforming means and inverting means are not essential, the present invention provides a recognition system for associating data vectors with predetermined models, and including:
a) compensating means for compensating for distortion in data vectors corresponding to a plurality of different types of data, the compensating means being arranged to apply compensations associated with respective data types to each data vector to produce a plurality of compensated data vectors,
b) matching means arranged to associate compensated data vectors and models and to indicate for each data vector an appropriate model and class of model corresponding to a respective data type, and
c) deriving means for deriving a compensation from the model indicated by the matching means and the data vector with which it is associated for use by the compensating means in distortion compensation for a respective data type associated with the model class.