1. Field of the Invention
The present invention relates to the field of automatic speech recognition. More particularly, the present invention relates to a method of adapting a neural network of an automatic speech recognition device, a corresponding adapted neural network and a corresponding automatic speech recognition device.
2. Description of the Related Art
An automatic speech recognition device is an apparatus which is able to recognise voice signals such as words or sentences uttered in a predefined language.
An automatic speech recognition device may be employed for instance in devices for converting voice signals into written text or for detecting a keyword allowing a user to access a service. Further, an automatic speech recognition device may be employed in telephone systems supporting particular services, such as providing a user with the telephone number of a given telephone subscriber.
In order to recognise a voice signal, an automatic speech recognition device performs steps, which will be briefly described herein after.
The automatic speech recognition device receives the voice signal to be recognised through a phonic channel. Examples of phonic channels are a channel of a fixed telephone network, of a mobile telephone network, or the microphone of a computer.
The voice signal is firstly converted into a digital signal. The digital signal is periodically sampled with a certain sampling period, typically of a few milliseconds. Each sample is commonly termed “frame”. Successively, each frame is associated to a set of spectral parameters describing the voice spectrum of the frame.
Then, such a set of spectral parameters is sent to a pattern matching block. For each phoneme of the language for which the automatic speech recognition device is intended, the pattern matching block calculates the probability that the frame associated to the set of spectral parameters corresponds to that phoneme.
As it is known, a phoneme is the smallest portion of a voice signal such that, replacing a first phoneme with a second phoneme in a voice signal in a certain language, two different signifiers of the language may be obtained.
A voice signal comprises a sequence of phonemes and transitions between successive phonemes.
For simplicity, in the following description and in the claims, the term “phoneme” will comprise both phonemes as defined above and transitions between successive phonemes.
Thus, generally speaking, the pattern matching block calculates a high probability for the phoneme corresponding to an input frame, a low probability for phonemes with voice spectrum similar to the voice spectrum of the input frame, and a zero probability for phonemes with a voice spectrum different from the voice spectrum of the input frame.
However, frames corresponding to the same phoneme may be associated to different sets of spectral parameters. This is due to the fact that the voice spectrum of a phoneme depends on different factors, such as the characteristics of the phonic channel, of the speaker and of the noise affecting the voice signal.
Phoneme probabilities associated to successive frames are employed, together with other language data (such, for instance, vocabulary, grammar rules, and/or syntax rules) to reconstruct words or sentences corresponding to the sequence of frames.
As already mentioned, the step of calculating phoneme probabilities of an input frame is performed by a pattern matching block. For instance, the pattern matching block may be implemented through a neural network.
A neural network is a network comprising at least one computation unit, which is called “neuron”.
A neuron is a computation unit adapted to compute an output value as a function of a plurality of input values (also called “pattern”). A neuron receives the plurality of input values through a corresponding plurality of input connections. Each input connection is associated to a respective weight. Each input value is firstly multiplied by the respective weight. Then, the neuron sums all the weighted input values. It might also add a bias, i.e.:
                              a          =                                                    ∑                i                            ⁢                                                w                  i                                ⁢                                  x                  i                                                      +            b                          ,                            [        1        ]            wherein a is the weighted linear combination of the input values, wi is the i-th input connection weight, xi is the i-th input value and b is the bias. In the following, for simplicity, is will be assumed that the bias is zero.
Successively, the neuron transforms the linear sum in [1] according to an activation function g(.). The activation function may be of different types. For instance, it may be either a Heaviside function (threshold function), or a sigmoid function. A common sigmoid function is defined by the following formula:
                              g          ⁡                      (            a            )                          =                              1                          1              +                              exp                ⁡                                  (                                      -                    a                                    )                                                              .                                    [        2        ]            
This type of sigmoid function is an increasing, [0;1]-limited function; thus, it is adapted to represent a probability function.
The activation function may also be a linear function, e.g. g(a)=k*a, where k is a constant; in this case, the neuron is termed “linear neuron”.
Typically, a neural network employed in an automatic speech recognition device is a multi-layer neural network.
A multi-layer neural network comprises a plurality of neurons, which are grouped in two or more cascaded stages. Typically, neurons of a same stage have the same activation function.
A multi-layer neural network typically comprises an input stage, comprising a buffer for storing an input pattern. In the speech recognition field, such an input pattern comprises a set of spectral parameters of an input frame, and sets of spectral parameters of a few frames preceding and following the input frame. In total, a pattern typically comprises sets of spectral parameters of seven or nine consecutive frames.
The input stage is typically connected to an intermediate (or “hidden”) stage, comprising a plurality of neurons. Each input connection of each intermediate stage neuron is adapted to receive from the input stage a respective spectral parameter. Each intermediate stage neuron computes a respective output value according to formulas [1] and [2].
The intermediate stage is typically connected to an output stage, also comprising a plurality of neurons. Each output stage neuron has a number of input connections which is equal to the number of intermediate stage neurons. Each input connection of each output stage neuron is connected to a respective intermediate stage neuron. Each output stage neuron computes a respective output value as a function of the intermediate stage output values.
In the speech recognition field, each output stage neuron is associated to a respective phoneme. Thus, the number of output stage neurons is equal to the number of phonemes. The output value computed by each output stage neuron is the probability that the frame associated to the input pattern corresponds to the phoneme associated to the output stage neuron.
For simplicity, a multi-layer network with a single intermediate stage has been described above. However, a multi-layer network may comprise a higher number of cascaded intermediate stages (typically two or three) between the input stage and the output stage.
In order that a neural network acquires the ability of computing, for each input frame, the phoneme probabilities, a training of the neural network is required.
Training is typically performed through a training set, i.e. a set of sentences that, once uttered, comprise all the phonemes of the language. Such sentences are usually uttered by different speakers, so that the network is trained in recognizing voice signals uttered with different voice tones, accents, or the like. Besides, different phonic channels are usually employed, such as different fixed or mobile telephones, or the like. Besides, the sentences are uttered in different environments (car, street, train, or the like), so that the neural network is trained in recognising voice signals affected by different types of noise.
Therefore, training a network through such a training set results in a “generalist” neural network, i.e. a neural network whose performance, expressed as a word (or phoneme) recognition percentage, is substantially homogeneous and independent from the speaker, the phonic channel, the environment, or the like.
However, in some cases, an “adapted” neural network may be desirable, i.e. a neural network whose performance is improved when recognising a predefined set of voice signals. For instance, a neural network may be:                speaker-adapted: performance is improved when voice signals are uttered by a certain speaker;        channel-adapted: performance is improved when voice signals are carried through a certain phonic channel;        vocabulary-adapted: performance is improved when voice signals comprise a predefined set of words; or        application-adapted: performance is improved when voice signals have application-dependent features (type of noise and type of speaker, type of channel and type of vocabulary, etc. . . . )        
In the following description and claims, the expression “adaptation set” will refer to a predetermined set of voice signals for which a neural network is adapted. An adaptation set comprises voice signals with common features, such as voice signals uttered by a certain speaker, as well as voice signals comprising a certain set of words, as well as voice signals affected by a certain noise type, or the like.
In the art, methods for adapting a neural network are known, i.e. methods for improving the performance of a given generalist neural network for a given adaptation set.
For instance, J. Neto et al. “Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system”, Proc. of Eurospeech 1995 presents and evaluates some techniques for speaker-adaptation of a hybrid HMM-artificial neural network (ANN) continuous speech recognition system. For instance, the LIN technique employs a trainable Linear Input Network (LIN) to map the speaker-dependent input vectors (typically PLP cepstral coefficients) to a SI (speaker-independent) system. This mapping is trained by minimising the error at the output of the connectionist system while keeping all the other parameter fixed. A further adaptation technique presented in this paper is the Retrained Speaker-Independent (RSI) adaptation, wherein, starting from a SI system, the full connectionist component is adapted to the new speaker. Further, this paper presents the Parallel Hidden Network (PHN), wherein additional, trainable hidden units are placed in the connectionist system; these extra units connect to input and outputs just like ordinary hidden units. During speaker adaptation, weights connecting to/from these units are adapted while keeping all other parameters fixed. Finally, this paper presents a GAMMA approach, wherein the speaker-dependent input vectors are mapped to the SI system (as in the LIN technique) using a gamma filter.
J. Neto et al. “An incremental speaker-adaptation technique for hybrid HMM-MLP recognizer”, Proc. of Intl. Conf. on Spoken Language Processing (ICSLP) 1996, Philadelphia, 1289-1292, describes a speaker-adaptation technique applied to a hybrid HMM-MLP system which is based on an architecture that employs a trainable LIN to map the speaker specific feature input vectors to the SI system.
S. Waterhouse et al. “Smoothed local adaptation of connectionist systems”, Proc. of Intl. Conf. on Spoken Language Processing (ICSLP) 1996, Philadelphia, describes a technique by which the transform may be locally linear over different regions of the input space. The local linear transforms are combined by an additional network using a non-linear transform.
V. Abrash, “Mixture input transformations for adaptation of hybrid connectionist speech recognizers”, Eurospeech 1997, Rhodes (Greece), describes an algorithm to train mixtures of transformation networks (MTN) in the hybrid connectionist recognition framework. This approach is based on the idea of partitioning the acoustic feature space into R regions and training an input transformation for each region.