This invention relates to the field of speech recognition and more particularly to a method and apparatus for normalizing channel specific feature elements in a signal derived from a spoken utterance. In general the method aims to compensate for changes induced in the feature elements of the signal as a result of a transmission of the signal through a certain communication channel.
In a typical speech recognition application, the user inputs a spoken utterance into an input device such as a microphone or telephone set. If valid speech is detected, the speech recognition layer is invoked in an attempt to recognize the unknown utterance. In a commonly used approach, the input speech signal is first pre-processed to derive a sequence of speech feature elements characterizing the input speech in terms of certain parameters. The sequence of speech feature elements is then processed by a recognition engine to derive an entry from a speech recognition dictionary that most likely matches the input speech signal. Typically, the entries in the speech recognition dictionary are made up of symbols, each symbol being associated to a speech model.
Prior to the use of a speech recognition system, the entries in the speech recognition dictionary as well as the speech models are trained to establish a reference memory and a reference speech model set. For speaker-independent systems, training is performed by collecting samples from a large pool of users. Typically, for a speaker-independent system, a single speech model set is used for all speakers while in a speaker-specific system, each user is assigned a respective speech model set. Speaker-specific systems are trained by collecting speech samples from the end user. For example, a voice dictation system where a user speaks and the device translates his words into text will most likely be trained by the end user (speaker-specific) since this training fashion can achieve a higher recognition accuracy. In the event that someone else than the original user wants to use the same device, that device can be retrained or an additional set of speech models can be trained and stored for the new user. As the number of users becomes large, storing a separate speaker specific speech model set for each user becomes prohibitive in terms of memory requirements. Therefore, as the number of users becomes large, speech recognition systems tend to be speaker independent.
In addition to interacting with different users, it is common for a speech recognition system to receive the signal containing the spoken utterance on which the speech recognition process is to be performed over different communication channels. In a specific example, a speech recognition system operating in a telephone network may process speech signals originating from a wireline communication channel or wireless communication channel, among others. Generally, such speech recognition systems use a common speech model set across the different communication channels. However, the variability between communication channels results in variability in the acoustic characteristics of the speech signals. Consequently, the recognition performance of a speech recognition system is adversely affected since the speech models in the common speech model set do not reflect the acoustic properties of the speech signal, in particular the changes induced in the signal by the channel used for transporting the signal. Since different channels can be used to transport the signal toward the speech recognition apparatus, and each channel induces different changes in the signal it is difficult to adapt the speech models such as to accurately compensate for such channel specific distortions introduced in the feature elements.
A commonly used technique to overcome this problem is exhaustive modeling. In exhaustive modeling, each communication channel that the speech recognition system is adapted to support is associated to a channel specific speech model set. For each channel, a plurality of speech samples are collected from the end-users in order to train a channel specific speech model set.
A deficiency in exhaustive modeling is that it requires a large amount of training data for each communication channel. This represents an important commissioning expense as the number of environmental and channel conditions increases.
Another common approach to improve the performance of speech recognition systems is adaptation: adjusting either speech models or features in a manner appropriate to the current channel and environment. A typical adaptation technique is model adaptation. Generally, model adaptation starts with reference speech models derived from one or more spoken utterances over a reference communication channel (say wireline communication channel) and then, based on a small amount of speech from a new communication channel (say a wireless communication channel), new channel-specific models are iteratively generated. For a more detailed explanation on model adaptation, the reader is invited to consult R. Schwartz and F Kubala, Hidden Markov Models and Speaker Adaptation, Speech Recognition and Understanding: Recent Advances, Eds: P. Laface et R. De Mori, Springer-Verlag, 1992; L. Neumeyer, A. Sankar and V. Digalakis, A Comparative Study of Speaker Adaptation Techniques, Proc. Of EuroSpeech ""95, pp. 1127-1130, 1995; J.-L. Gauvain, G.-H. Lee, Maximum a Posteriori Estimation for Multivariate Gaussain Mixture Observations of Markov Chains, IEEE. Trans. on Speech and Audio Processing, Vol.2, April 1994, pp. 291-298; and C. J. Leggetter, P. C. Woodland, Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models, Computer, Speech and Language, Vol.9, 1995, pp. 171-185. The content of these documents is hereby incorporated by reference.
A deficiency in the above-described methods is that, in order to obtain channel-specific models providing reasonable performance, a relatively large amount of data that may not be readily available is required.
Consequently, there is a need in the industry for providing a method and apparatus to compensate for channel specific changes induced in the feature elements of a signal derived from a spoken utterance, the signal being intended for processing by a speech recognition apparatus.
In accordance with a broad aspect, the invention provides an apparatus for normalizing speech feature elements in a signal derived from a spoken utterance. The apparatus has an input for receiving the speech feature elements which are transmitted over a certain channel. The certain channel is a path or link over which data passes between two devices and is characterized by a channel type that belongs to a group of N possible channel types. Non-limiting examples of possible channel types include a hand-held channel (a path or link established by using a hand-held telephone set), a wireless channel (a path or link established by using a wireless telephone set) and a hands-free channel (a path or link established by using a hands-free telephone set) among a number of other possible channel types.
The apparatus includes a processing unit coupled to the input. The processing unit alters or skews the speech feature elements to simulate a transmission over a reference channel that is other than the channel over which the transmission actually takes place.
The signal output by the apparatus is suitable for processing by a speech recognition apparatus.
One of the benefits of this invention is an increase of the speech recognition accuracy by a reduction in the variability introduced in the speech signal on which the recognition is made by the particular channel over which the signal is transmitted.
The reference channel can correspond to a real channel (for example, the hands-free channel) or it can be a virtual channel. Such a virtual channel does not physically exist. It is artificially defined by certain transmission characteristics that are arbitrarily chosen.
In a specific non-limiting example of implementation, the processing unit classifies a speech feature element that contains channel specific distortions in a class of speech feature elements selected from a set of possible classes of speech feature elements. This classification is done at least in part on the basis of acoustic characteristics of the speech feature element. The processing unit then derives transformation data from a set of possible transformation datum on the basis of the class to which the feature element has been assigned. The processing unit processes the feature element containing the channel specific distortion on the basis of the transformation data to generate a normalized speech feature element.
In accordance with another broad aspect, the invention provides a method for normalizing speech feature elements in a signal derived from a spoken utterance. The method comprises receiving the speech feature elements transmitted over a certain channel, the certain channel being characterized by a channel type that belongs to a group of N possible channel types. The method includes altering the speech feature elements at least in part on the basis of the channel type to generate normalized speech feature elements. The normalized speech feature elements simulate a transmission of the speech feature elements over a reference channel that is other than the channel over which the actual transmission takes place.
In accordance with another broad aspect, the invention provides a computer readable medium comprising a program element suitable for execution by a computing apparatus for implementing the above-described method.
In accordance with another broad aspect, the invention further provides a computer readable storage medium holding a data structure that stores of a plurality of transformation sets. Each transformation set is associated to an identifier allowing uniquely distinguishing the transformation sets from one another. Each transformation set includes at least one transformation data element allowing normalizing a speech feature element.
In accordance with another broad aspect, the invention provides an apparatus for generating a transformation set associated to a given communication channel. The apparatus has a stereo database, a processing unit and an output. The stereo database includes a plurality of data element pairs, each data element pair being derived from a spoken utterance. Each data element pair includes a first data constituent derived from the spoken utterance conveyed over a reference communication channel and a second data constituent derived from the spoken utterance conveyed over the given communication channel.
The processing unit is coupled to the stereo database. The processing unit groups the data element pairs to generate a set of classes. A certain data element pair is associated to a certain class at least in part on the basis of acoustic characteristics of the certain data element pair. The processing unit then processes the set of classes generated to derive a transformation set, the transformation set including a plurality of transformation data elements.
For the purpose of this specification, the expression xe2x80x9cspeech feature elementxe2x80x9d is used to designate an element that alone or in combination with other such elements characterizes or defines acoustically speech sound information. Examples of speech feature elements include feature vectors. The feature vectors may be comprised of spectral parameters, audio signal segments, band energies and cepstral parameters, among others.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.