1. Field of the Invention
The invention relates to a method for the selection of synthesis units.
It relates for example to a method for the selection and encoding of synthesis units for a speech encoder working at very low bit rates, for example at less than 600 bits/sec.
2. Description of the Prior Art
Techniques for the indexing of natural speech units have recently enabled the development of particularly efficient text-to-speech synthesis systems. These techniques are now being studied in the context of speech encoding at very low bit rates, in conjunction with algorithms taken from the field of voice recognition Ref. [1-5]. The main idea here consists of the identification, in the speech signal to be encoded, of a segmentation that is almost optimal in terms of elementary units. These units may be units obtained from a phonetic transcription, which has the drawback of having to be corrected manually for an optimum result, or corrected automatically according to criteria of spectral stability. On the basis of this type of segmentation, and for each of the segments, a search is made for the nearest synthesis unit in a dictionary obtained during a preliminary learning phase, and containing reference synthesis units.
The encoding scheme used consists in modeling the acoustic space of the speaker (or speakers) by hidden Markov models (HMM). These models, which are dependent on or independent of the speaker, are obtained in a preliminary learning phase from algorithms identical to those implemented in speech recognition systems. The essential difference lies in the fact that the models are learned on vectors assembled by classes automatically and not in a way that is supervised on the basis of a phonetic transcription. The learning procedure then consists in automatically obtaining the segmentation of the learning signals (for example by using the method known as temporal decomposition) and assembling the segments obtained into a finite number of classes corresponding to the number of HMMs to be built. The number of models is directly related to the resolution sought to represent the acoustic space of the speaker or speakers. Once obtained, these models are used to segment the signal to be encoded through the use of a Viterbi algorithm. The segmentation enables the association, with each segment, of the class index and its length. Since this information is not sufficient to model the spectral information, for each of the classes, a spectral path is selected from among several units known as synthesis units. These units are extracted from the learning base during its segmentation using the HMMs. The context can be taken into account, for example by using several sub-classes through which the transitions from one class to another are taken into account. A first index indicates the class to which the segment considered belongs, a second index specifies the sub-class to which it belongs as being the class index of the previous segment. The sub-class index therefore does not have to be transmitted, and the class index must be memorized for the next segment. The sub-classes thus defined make it possible to take account of the different transitions towards the class associated with the considered segment. To the spectral information, there is added information on prosody, namely the value of the pitch and energy parameters and their progress.
In order to obtain an encoder working at very low bit rates, it is necessary to optimize the allocation of the bits and hence of the bit rate between the parameters associated with the spectral envelope and the information on prosody. The classic method consists initially in selecting the unit that is nearest from a spectral viewpoint and then, once the unit is selected, in encoding the prosody information, independently of the selected unit.