1. Field of the Invention
The present invention relates to a method for the encoding of speech at very low bit rates and to an associated system. It can be applied especially to systems of speech encoding/decoding by the indexing of variably sized units.
The speech encoding method implemented at low bit rates, for example at a bit rate of about 2400 bits/s, is generally that of the vocoder using a wholly parametrical model of speech signals. The parameters used relate to voicing which describes the periodic or random character of the signal, the fundamental frequency or “pitch” of the voiced sounds, the temporal evolution of the energy values as well as the spectral envelope of the signal generally modelled by an LPC (linear predictive coding) filter.
These different parameters are estimated periodically on the speech signal, typically every 10 to 30 ms. They are prepared in an analysis device and are generally transmitted remotely towards a synthesizing device that reproduces the speech signal from the quantified value of the parameters of the model.
2. Description of the Prior Art
Hitherto, the lowest standardized bit rate for a speech encoder using this technique has been 800 bits/s. This encoder, standardized in 1994, is described in the NATO STANAG 4479 standard and in an article by B. Mouy, P. De La Noue and G. Goudezeune, “NATO STANAG 4479: A Standard for an 800 bps Vocoder and Channel Coding in HF-ECCM system”, IEEE Int. Conf. on ASSP, Detroit, pp. 480–483, May 1995. It relies on an LPC 10 type technique of frame-by-frame (22.5 ms) analysis and makes maximum use of the temporal redundancy of the speech signal by grouping the frames in sets of three before encoding the parameters.
Although it is intelligible, the speech reproduced by these encoding techniques is of fairly poor quality and is not acceptable once the bit rate goes below 600 bits/s.
One way to reduce the bit rate is to use phonetic type segmental vocoders with variable-time segments that combine the principles of speech recognition and synthesis.
The encoding method essentially uses a system of automatic recognition of speech in continuous flows. This system segments and “labels” the speech signal according to a number of variably-sized speech units. These phonetic units are encoded by indexing in a small dictionary. The decoding relies on the principle of speech synthesis by concatenation on the basis of the index of the phonetic units and on the basis of the prosody. The term “prosody” encompasses mainly the following parameters: the energy of the signal, the pitch, a piece of voicing information and, as the case may be, the temporal rhythm.
However, the development of phonetic encoders requires substantial knowledge of phonetics and linguistics as well as a phase of phonetic transcription of a learning database that is costly and may be a source of error. Furthermore, phonetic encoders have difficulty in adapting to a new language or a new speaker.
Another technique described for example in the thesis by J. Cernocky, “Speech Processing Using Automatically Derived Segmental Units: Applications to Very Low Rate Coding and Speaker Verification”, University of Paris XI Orsay, December 1998, gets around the problems related to the phonetic transcription of the learning database by determining the speech units automatically and independently of language.
The working of this type of decoder can be subdivided chiefly into two steps: a learning step and an encoding/decoding step described in FIG. 1.
During the learning step (FIG. 1), an automatic procedure, for example after a parametrical analysis 1 and a segmentation step 2, determines a set of 64 classes of acoustic units designated “AU”. With each of these classes of acoustic units, there is associated a statistical model 3, which is a model of the Markov (or HMM, namely Hidden Markov Model) type, as well as a small number of units representing a class known as “representatives” 4. In the present system, the representatives are simply the eight longest units belonging to one and the same acoustic class. They may also be determined as being the N most representative units of the acoustic unit. During the encoding of a speech signal after a step of parametrical analysis 5 used to obtain especially the spectral parameters, the energy values, the pitch, a recognition procedure (6, 7) using a Viterbi algorithm determines the succession of acoustic units of the speech signal and identifies the “best representative” to be used for the speech synthesis. This choice is done for example by using a spectral distance criterion such as the DTW (dynamic time warping) algorithm.
The number of the acoustic class, the index of this representative unit, the length of the segment, the contents of the DTW and the prosody information derived from the parametrical analysis are transmitted to the decoder. The speech synthesis is done by concatenation of the best representatives, possibly by using an LPC type parametrical synthesizer.
To concatenate the representatives during the speech decoding, one method used is, for example, a method of parametrical speech analysis/synthesis. This parametrical method enables especially modifications of prosody such as temporal evolution, the fundamental frequency or pitch as compared with a simple concatenation of waveforms.
The parametrical speech model used by the method of analysis/synthesis may be a voiced/non-voiced binary excitation of the LPC 10 type as described in the document by T. Tremain, “The Government Standard Linear Predictive Coding Algorithm: LPC-10”, published in the journal Speech Technology, Vol. 1, No. 2, pp. 40–49.
This technique encodes the spectral envelope of the signal in 185 bits/s approximately for a monospeaker system, for an average of about 21 segments per second.
Hereinafter in the description, the following terms have the following meanings:                the term “representative” corresponds to one of the segments of the learning base which has been judged to be representative of one of the classes of acoustic units,        the expression “recognized segment” corresponds to a speech segment that has been identified as belonging to one of the acoustic classes, by the encoder,        the expression “best representative” designates the representative determined at the encoding that best represents the recognized segment.        