1. Field of the Invention
This invention relates to a system of speech recognition. More specifically, this invention relates to a logic unit and a system of speech recognition with the aid of a neural network. This invention also relates, however, to a new neural network for applications other than speech recognition.
2. Discussion of the Background
Methods of performing speech recognition are of crucial importance in particular for the development of new telecommunications services. The qualities required of a speech recognition system are in particular the following:
Precisionxe2x80x94systems making it possible to recognize correctly less than a very high percentage, for example less than 85 percent, of words have only few practical applications.
Insensitivity to noisexe2x80x94the systems must allow a satisfactory recognition even in a noisy environment, for example when the communications are transmitted through a mobile telephone network.
Large vocabularyxe2x80x94for a lot of applications it is necessary to be able to recognize a high number of different wordsxe2x80x94for example more than 5000.
Independence of speakerxe2x80x94a lot of applications require satisfactory recognition regardless of who the speaker is, and the same for speakers unknown to the system.
The known systems of speech recognition generally carry out two distinct tasks. A first task consists in converting the voice into a digital signal and of extracting a sequence of vectors of voice parameters from this digital signal. Different systems are known for executing this task which generally allow conversion of each frame, of 10 milliseconds for example, of voice into a vector (xe2x80x9cfeatures vectorxe2x80x9d) containing a group of parameters describing at best this frame in the time and frequency domain.
The second task consists in classifying the sequence of vectors received by means of a classifier and establishing to which class (corresponding, for example, to phonological elements such as phonemes, words or sentences, for example) they correspond with the greatest probability among the classes defined during a learning phase of the system. The problem for classifiers is thus to determine, for each input speech vector, the probability of belonging to each defined class.
The speech recognition systems most widely used at the present time use a classifier functioning with the aid of hidden Markov models, better known by the Anglo-Saxon designation Hidden Markov Models (HMM), and illustrated by FIG. 1a. This statistical method describes the voice through a sequence of Markov states 81, 82, 83. The different states are connected by links 91-96 indicating the probabilities of transition from one state to another. Each state emits a voice vector, with a given probability distribution. A sequence of states, defined a priori, represents a predefined phonological unit, for example a phoneme or a triphone. A description of this method is given, for example, by Steve Young in an article entitled xe2x80x9cA Review of Large-Vocabulary Continuous-Speech Recognition,xe2x80x9d published in September 1996 in the IEEE Signal Processing Magazine. In spite of a very poor modelling of time relations between successive speech vectors, this method currently offers the best rates of recognition.
Other systems of classification, which have made it possible to achieve a certain success, use networks of artificial neurons, such as illustrated in FIG. 1b, in particular time delay neural networks (TDNNxe2x80x94Time Delay Neural Networks) or recurrent neural networks (RNNxe2x80x94Recurrent Neural Network). Examples of such systems are described in particular by J. Ghosh et al. in xe2x80x9cClassification of Spatiotemporal Patterns with Applications to Recognition of Sonar Sequencesxe2x80x9d in Neural Representation of Temporal Patterns, pages 227 to 249, edited by E. Covey et al., Plenum Press, New York, 1995. All these systems use a delay line comprising registers 25 for the input speech vectors 2 as well as delay elements 26 in their architecture. Computing elements 11 (neurons), interconnected (by means of synapses) with the registers 25 and organized in a hierarchical manner, allow particular phonological elements to be identified. These systems also make it possible to model the time relation between past information and current information, and to correct certain weaknesses of HMMs, without, however, succeeding in replacing them completely.
A more recent approach consists in combining the HMMs with neural networks in hybrid speech recognition systems. Such systems are described, for example, by H. Boulard et al. in xe2x80x9cConnectionist Speech Recognitionxe2x80x94A Hybrid Approach,xe2x80x9d 1994, Kluwer Academic Publishers (NL). These systems have the advantage of a better modelling of context and of phonemes than the HMMs. The price to pay for these systems, however, is either a long training time, due to the error back propagation (EBP) algorithm used, or a limited number of weighting coefficients available for modelling the speech signals.
Another approach is disclosed in the American U.S. Pat. No. 5,220,640 of Jun. 15, 1993. This document describes a neural network architecture by which the input signal has been scaled differently by a xe2x80x9ctime-scaling network.xe2x80x9d The output signals indicate how the entering signals have been changed in scale correspond to learned patterns.
These different systems generally model each word as a sequence of phones, and are optimized to identify each phone in a speech signal as precisely as possible. A correct identification of each phone ensures in principle a perfect recognition of words or of sentencesxe2x80x94insofar as these words and these sentences are correctly modelled. In practice, all these systems have the drawback of a lack of robustness in noisy conditions or of results of variable quality, as indicated in particular by S. Greenberg in xe2x80x9cOn the origins of speech intelligibility in the real world,xe2x80x9d ESCA-NATO Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, 17th-18th Apr. 1997, Pont-à-Mousson, France, and by Steve Young in the article indicated further above.
One object of the present invention is thus to propose a system and a method of speech recognition that avoids the drawbacks of prior art systems and methods. More specifically, an object of the present invention is to propose a classifier and a method of classifying speech vectors, improved over prior art classifiers and classification methods.
Another object of the present invention is to improve the performance of a classifier without adding substantially to the complexity, in particular without adding substantially to the number of computing elements.
According to the invention, these various objects are achieved thanks to the features of the independent claims, preferred variants being indicated moreover in the dependent claims.
The invention begins with the observation that speech is more than a linear succession of phones of equal importance for recognition. Experience has shown that even experienced listeners struggle to identify more than 60% of phones presented in isolation; only the context permits the human brain to comprehend sentences and to identify, a posteriori, each phone.
The invention puts this discovery to use by suggesting, for the recognition of speech, integration of features of speech segments much longer than that done in the prior artxe2x80x94for example features of several syllables, of a whole word, even of several words or even of an entire sentence.
To avoid adding to the complexity of the system and the number of computing elements, a hierarchical architecture is proposed, with a system of several tiers. Each tier comprises at least one spatiotemporal neural network (STNN). The rate of signals input in the different tiers of the system is variable, in such a manner that the rate of speech vectors input in the lower tier is adapted, for example, to the recognition of isolated phones, or other brief phonological elements, whereas the rate of signals applied on the upper tiers permits, for example, recognition of longer phonological elementsxe2x80x94for example syllables, triphones, words or even entire sentences. Decimeters are provided in at least one tier to reduce the rate of signals applied to the upper tiers. Inversely, interpolators are provided to increase the rate of target signals given to the system during the learning phase.
The invention also proposes an architecture for multirate neural networks, using decimators and interpolators in their architecture. The invention makes it possible to achieve, with a limited number of computing elements (neurons) and synapses, a neural network whose output is dependent upon a large number of speech vectors and/or whose learning capacity is increased.
The invention permits moreover weighting of the importance of different segments of speech (frames), and classifying each speech vector with respect to large number of prior vectors.