This invention relates to a speech analysis system for processing speech which is subject to different forms of distortion. It is particularly (although not exclusively) relevant to recognition of words, languages or speakers in two way telephone conversations.
The problem to which the invention is addressed may be illustrated in one aspect by automatic speech recognition technology as used in telephone systems. Here the system""s performance is often severely degraded by changes in a speech signal due to the position of the telephone handset or by the characteristics of the handset, telephone line and exchange. Attempts may be made to compensate for the problem by using some form of automatic gain control (AGC). Unfortunately this may be difficult to implement. For example, in two way telephone conversations in which the apparatus is connected using a two wire configuration, there are often substantial differences between the intensity levels of the speech signals of the persons speaking to one another. Using more sophisticated technology it is possible to intercept a call at a local exchange and to obtain separate signals from each telephone instrument. While this offers some improvement it does not address the difficult problem of reverse channel echo, which arises from contamination of the speech of one party to the conversation with that of the other.
The problem is not limited to differences in speech level. Many speech recognition systems attempt to adapt in some manner to the characteristics of the individual speaker or microphone. If speaker characteristics change frequently, compensation becomes very difficult.
Various methods are known for improving recognition performance by compensating for distortion or speaker characteristics. Current speech recognition systems convert the input signal from a waveform in the time domain into successive vectors in the frequency domain during a process sometimes known as xe2x80x9cfilterbank analysisxe2x80x9d. These vectors are then matched to models of the speech signal. In some systems the vectors undergo a transformation prior to matching to speech models. It is possible to counteract signal distortion and speaker effects by applying some form of compensation to the vectors before transformation and matching. There are a number of known methods for determining the appropriate compensation. One such method is disclosed by Sadaoki Furui, xe2x80x9cCepstral Analysis Technique for Automatic Speaker Verificationxe2x80x9d, IEEE Trans Acoustics, Speech and Signal processing, 29(2):254-272, April 1981. It involves averaging data obtained by filterbank analysis over an entire conversation to obtain the long term spectral characteristics of a signal and applying a compensation for distortions during a second pass over the data. The compensated data is then passed to a speech recognition device for matching to speech models.
There are two main problems with this approach when applied to multi-speaker speech signals or single speaker speech signals where the form of distortion changes. First, since a single correction is applied for the entire conversation it is poorly suited to conversations in which the speaker characteristics change frequently. This may happen during telephone conversations or other dialogues. Secondly, it is necessary to process the entire conversation to obtain the appropriate correction before recognition commences, which makes it unsuitable for real time applications.
A preferable approach is to use a technique sometimes known as spectral shape adaptation (SSA). A recognition system using this technique provides information on the expected spectral characteristics of the signal to be recognised at each time instant, and this is compared to the equivalent actually present in that signal to provide a difference term. The difference term is then averaged over a number of successive signals (time averaging) to provide a correction term.
A system of this kind has been described by Yunxin Zhao, xe2x80x9cIterative Self-Learning Speaker and Channel Adaptation under Various Initial Conditionsxe2x80x9d, Proc IEEE ICASSP [11] pages 712-715. Here data is processed on a sentence by sentence basis. An input signal undergoes filterbank analysis to create successive vectors each indicating the variation in signal energy over a number of frequency bands. The vectors are processed by matching to speech model states. The parameters of the model state to which a vector has been matched are used to predict a value for that vector which would be expected according to the model. The difference between the vector and the predicted value is computed and time averaged with difference values obtained for earlier vectors from the sentence to determine the average distortion suffered by each sentence. The SSA parameters determined for one sentence are then used to process the next sentence.
Zhao""s approach is unfortunately not appropriate where there are two or more speakers or forms of distortion because it can result in SSA parameters derived from speech of one speaker or subject to a particular form of distortion being applied in connection with a different speaker or form of distortion.
It is an object of the invention to provide a speech analysis system arranged to counteract multiple forms of distortion.
The present invention provides a speech analysis system for processing speech which has undergone distortion, and including compensating means for modifying data vectors obtained from speech to compensate for distortion, matching means for matching modified data vectors to models, and deriving means for deriving distortion compensation from data vectors for use by the compensating means; characterised in that:
a) the compensating means is arranged to compensate for a plurality of forms of distortion by modifying each data vector with a plurality of compensations to provide a respective set of modified data vectors compensated for respective forms of distortion,
b) the matching means is arranged to indicate the modified data vector in each set exhibiting the greatest matching probability and the form of distortion for which it has been compensated, and
c) the deriving means is arranged to derive compensation on the basis of the modified data vector in each set exhibiting greatest matching probability for use by the compensating means in compensating for the form of distortion for which that modified data vector was compensated.
The invention provides the advantage that compensation differentiates between forms of distortion so that the likelihood of correct speech analysis is improved.
The invention may be arranged to analyse speech from a plurality of speech sources each associated with a respective form of distortion, and wherein:
a) the compensating means is arranged to provide modified data vectors in each set compensated for distortion associated with respective speech sources,
b) the matching means is arranged to implement models divided into classes associated with speech and non-speech, and to indicate the model class associated with the modified data vector in each set exhibiting the greatest matching probability, and
c) the deriving means is arranged to derive a compensation from modified data vectors associated with speech class models.
The system of the invention may be arranged to update non-speech models within the matching means. The matching means may be arranged to identify the modified data vector in each set exhibiting the greatest matching probability taking into account earlier matching and speech recognition constraints, in order to assess matching probability over a sequence of data vectors.
The deriving means may be arranged to derive a compensation by averaging over a contribution from the modified data vector in each set exhibiting the greatest matching probability and the model with which it is matched and preceding contributions of like kind. Averaging may be carried out with by infinite impulse response filtering means.
The matching means may be arranged to implement hidden Markov model matching based on speech models with states having matching probability distributions and associated estimation values for vectors matching therewith; the estimation values may be mean values of respective probability distributions; the deriving means may be arranged to employ estimation values to derive compensation. Each model may have one or more states.
The deriving means and the compensating means may be arranged in combination to avoid implementing compensation not associated with a speech source. The matching means may employ models in different classes associated with respective types of acoustic data source, such as speech and noise sources, and may indicate that compensation is not to be derived in response to matching to a noise source. It may be arranged to adapt speech models to increase conformity with data vectors.
In one embodiment, the system of the invention includes means for generating data vectors having elements representing logarithmically expressed averages over respective frequency intervals, and wherein:
a) the compensating means is arranged to provide a set of modified data vectors by adding to each data vector a set of compensation vectors associated with respective forms of distortion, and
b) the deriving means is arranged to derive an updated compensation vector from a first contribution from the modified data vector in each set exhibiting the greatest matching probability and an estimation vector from the model with which it is matched, together with earlier like contributions associated with the same model class.
The invention may include a respective channel for transfer of each modified data vector to the matching means.
In another aspect, the invention provides a method for analysing speech which has undergone distortion including the steps of:
a) modifying speech data vectors to compensate for distortion,
b) matching modified data vectors to models, and
c) deriving and applying distortion compensation, characterised in that:
i) step (a) comprises applying a plurality of compensations to each data vector to provide a respective set of modified data vectors compensated for respective forms of distortion,
ii) step (b) comprises identifying the modified data vector in each set exhibiting the greatest matching probability and the form of distortion for which it was compensated, and
iii) step (c) includes deriving a compensation from the modified data vector in each set exhibiting the greatest matching probability for use in compensating for the form of distortion for which that vector was compensated.
The system of the invention may be employed for speech recognition, or alternatively for other analysis purposes such as language identification, recognition or assessment of a speaker""s age, gender or other attributes. It may be used to detect which of a variety of speakers is talking at a given instant.