It must be emphasised that both the state of the art represented in the following, and the present invention relate to the entire field of the synthesis of acoustical data by means of the concatenation of individual audio segments which are obtained in any manner. However, for the sake of simplifying the discussion of the state of the art as well as the description of the present invention, the following explanations refer specifically to synthesised voice data by means of the concatenation of individual voice segments.
During the past years, the data-based approach has been successful over the rule-based approach in the field of speech synthesis, and can be found in various methods and systems for speech synthesis. Although the rule-based approach principally enables a better speech synthesis, it is necessary for its implementation to explicitly phrase the entire knowledge which is required for speech generation, i.e. to formally model the speech to be synthesised. Due to the fact that the known speech models comprise a simplification of the speech to be synthesised, the voice quality of the speech generated in this manner is not sufficient.
For this reason, a data-based speech synthesis is carried out to an increasing extent, wherein corresponding segments are selected from a database containing individual voice segments and linked (concatenated) to each other. In this context, the voice quality is primarily depending on the number and type of the available voice segments, because only that speech can be synthesised which is reproduced by voice segments in the data-base. In order to minimise the number of the voice segments to be provided and, nevertheless, to still generate a high quality synthesised speech, various methods are known which carry out a linking (concatenation) of the voice segments according to complex rules.
When using such methods or corresponding devices, respectively, an inventory, i.e. a database comprising the voice audio segments can be employed which is complete and manageable. An inventory is complete if it is capable of generating any sound sequence of the speech to be synthesised, and it is manageable if the number and type of the data of the inventory can be processed in a desired manner by means of the technically available means. Furthermore, such a method must ensure that the concatenation of the individual inventory elements generates a synthesised speech which differs as little as possible from a naturally spoken speech. To this end, a synthesised speech must be fluent and comprise the same articulatory effects as a natural speech. In this context, the so-called co-articulatory effects, i.e. the mutual influence of phones, are of particular importance. For this reason, the inventory elements should be of such a nature that they consider the co-articulation of individual successive phones. In addition, a method for the concatenation of the inventory elements should link the elements, even beyond word and phrase boundaries, under consideration of the co-articulation of individual successive phones as well as of the higher-order co-articulation of several successive phones.
Before presenting the state of the art, a few terms from the field of speech synthesis, which are necessary for a better understanding, will be explained in the following:
A phone is a class of any sound events (noises, sounds, tones, etc.). The sound events are classified in accordance with a classification scheme into phone classes. A sound event belongs to a phoneme if the values of the sound event are within the range of values defined for the phone with respect to the parameters (e.g. spectrum, tone level, volume, chest or head voice, co-articulation, resonance cavities, emotion, etc.) used for, the classification.
The classification scheme for phones depends on the type of application. For vocal sounds (=phones), the IPA classification is generally used. However, the definition of the term phone as used herein is not limited to this, but any other parameters can be used. If, for example, in addition to the IPA classification, the tone level or the emotional expression are included as parameters in the classification, two ‘a’ phones with different tone level or different emotional expression become different phones in the sense of the definition. Phones can, however, also be the tones of a musical instrument, e.g. a violin, in the different tone levels and the different modes of playing (up-bow and down-bow, detaché, spiccato, marcato, pizzicato, col legno, etc.). Phones can be the barking of dogs or the squealing of a car door.
Phones can be reproduced by audio segments which contain corresponding acoustical data.
In the description of the invention following the definitions, the term vocal sound can invariably be replaced by the term phone in the sense of the previous definition, and the term phoneme can be replaced by the term phonetic character. (This also applies the other way round, because phones are vocal sounds classified according to the IPA classification).
A static phone has bands which are similar to previous or subsequent bands of the static phone. The similarity need not necessarily be an exact correspondence as in the periods of a sinusoidal tone, but is analogous to the similarity as it prevails between the bands of the static phones defined in the following.
A dynamic phone has no bands with a similarity with previous or subsequent bands of the dynamic phone, such as, e.g. the sound event of an explosion or a dynamic phone.
A phone is a vocal sound which is generated by the organs of speech (a vocal sound). The phones are classified into static and dynamic phones.
The static phones include vowels, diphtongs, nasals, laterals, vibrants, and fricatives.
The dynamic phones include plosives, affricates, glottal stops, and click sounds.
A phoneme is the formal description of a phone, with the formal description usually being effected by phonetic characters.
The co-articulation refers to the phenomenon that a sound, i.e. a phone, too, is influenced by upstream or downstream sounds or phones, respectively, with the co-articulation occurring both between immediately neighbouring sounds/phones, but also covering a sequence of several sounds/phones as well (for example in rounding the lips).
A sound or phone, respectively, can therefore be classified into three bands (see also FIG. 1b):
The initial co-articulation band comprises the band from the start of a sound/phone to the end of the co-articulation due to a upstream sound/phone.
The solo articulation band is the band of the sound/phone which is not influenced by an upstream or downstream sound or an upstream or downstream phone, respectively.
The end co-articulation band comprises the band from the start of the co-articulation due to a downstream sound/phone to the end of the sound/phone.
The co-articulation band comprises an end co-articulation band and the neighbouring initial co-articulation band of the neighbouring sound/phone.
A polyphone is a sequence of phones.
The elements of an inventory are audio segments stored in a coded form which reproduce sounds, portions of sounds, sequences of sounds, or portions of sequences of sounds, or phones, portions of phones, polyphones, or portions of polyphones, respectively. For a better understanding of the potential structure of an audio segment/inventory element, reference is made to FIG. 2a which shows a conventional audio segment, and FIGS. 2b–2l which show inventive audio segments. In addition, it should be mentioned that audio segments can can be formed from smaller or larger audio segments which are included in the inventory or a database. Furthermore, audio segments can also be provided in a transformed form (e.g. in a Fourier-transformed form) in the inventory or the database. Audio segments for the present invention can also come from a prior synthesis step (which is not part of the method). Audio segments include at least a part of an initial co-articulation band, a solo articulation band, and/or an end co-articulation band. In lieu of audio segments, it is also possible to use bands of audio segments.
The term concatenation implies the joining of two audio segments.
The concatenation instance if the point of time in which two audio segments are joined.
The concatenation can be effected in various ways, e.g. with a cross fade or a hard fade (see also FIGS. 3a–3e):
In a cross fade, a downstream band of a first audio segment band and an upstream band of a second audio segment band are processed by means of suitable transfer functions, and subsequently these two bands are overlappingly added in such a manner that at the most the shorter band with respect to time of the two bands is completely overlapped by the longer one with respect to time of the two band.
In a hard fade, a later band of a first audio segment and an earlier band of a second audio segment are processed by means of suitable transfer functions, with the two audio segments being joined to one another in such a manner that the later band of the first audio segment and the earlier band of the second audio segment do not overlap.
The co-articulation band is primarily noticeable in that a concatenation therein is associated with discontinuities (e.g. spectral skips).
In addition, reference is to be made that, strictly speaking, a hard fade is a boundary case of a cross fade, in which an overlap of a later band of a first audio segment and an earlier band of a second audio segment has a length of zero. This allows to replace a cross fade with a hard fade in certain, e.g. extremely time-critical applications, with such an approach to be contemplated scrupulously, because it results in considerable quality losses in the concatenation of audio segments which actually are to be concatenated by a cross fade.
The term prosody refers to changes in the voice frequency and the voice rhythm which occur in spoken words or phrases, respectively. The consideration of such prosodic information is necessary in the speech synthesis in order to generate a natural word or phrase melody, respectively.
From WO 95/30193 a method and a device are known for the conversion of text to audible voice signals under utilising a neural network. For this purpose, the text to be converted to speech is converted to a sequence of phonema by means of a converter unit, with information on the syntactic boundaries of the text and the stress of the individual components of the text being additionally generated. This information, together with the phonema, are transferred to a device which determines the duration of the pronunciation of the individual phonema in a rule-based manner. A processor generates a suitable input for the neural network from each individual phoneme in connection with the corresponding syntactic and time-related information, with said input for the neural network also comprising the corresponding prosodic information for the entire phoneme sequence. From the available audio segments the neural network then selects only those segments which best reproduce the input phonema and links said audio segments accordingly. In this linking operation the individual audio segments with respect to their duration, total amplitude, and frequency are matched to upstream and downstream audio segments under consideration of the prosodic information of the speech to be synthesised and time successively connected with each other. A modification of individual bands of the audio segments is not described therein.
For the generation of the audio segments which are required for this method, the neural network has first to be trained by dividing naturally spoken speech into phones or phone sequences and assigning these phones or phone sequences corresponding phonema or phoneme sequences in the form of audio segments. Due to the fact that this method provides for a modification of individual audio segments only, but not for a modification of individual bands of an audio segment, the neural network must be trained with as many different phones or phone sequences as possible for converting any text to a synthesised speech with a natural sound. Depending of the application, this may prove to require very high expenditures. On the other hand, an insufficient training process of the neural network may have a negative influence on the quality of the speech to be synthesised. Moreover, it is not possible with the method described therein to determine the concatenation instance of the individual audio segments depending on upstream or downstream audio segments, in order to perform a co-articulation-specific concatenation.
U.S. Pat. No. 5,524,172 describes a device for the generation of synthesised speech, which utilises the so-called diphone method. Here, a text which is to be converted to synthesised speech is divided into phoneme sequences, with corresponding prosodic information being assigned to each phoneme sequence. From a database which contains audio segments in the form of diphones, for each phoneme of the sequence two diphones reproducing the phoneme are selected and concatenated under consideration of the corresponding prosodic information. In the concatenation the two diphones each are weighted by means of a suitable filter, and the duration and tone level of both diphones modified in such a manner that upon the linking of the diphones a synthesised phone sequence is generated, whose duration and tone level correspond to the duration and tone level of the desired phoneme sequence. In the concatenation the individual diphones are added in such a manner that a later band of a first diphone and an earlier band of a second diphone overlap, with the instance of concatenation being generally in the area of stationary bands of the individual diphones (see FIG. 2a). Due to the fact that a variation of the instance of concatenation under consideration of the co-articulation of successive audio segments (diphones) is not intended, the quality (naturalness and audibility) of a speech synthesised in such a manner can be negatively influenced.
A further development of the previously discussed method can be found in EP-0,813,184 A1. In this case, too, a text to be converted to synthesised speech is divided into individual phonema or phoneme sequences, and corresponding audio segments are selected from a database and concatenated. In order to achieve an improvement of the synthesised speech, two approaches have been realised with this method, which differ from the state of the art discussed so far. With the use of a smoothing filter which accounts for the lower-frequency harmonic frequency components of an upstream and a downstream audio segment, the transition from the upstream audio segment to the downstream audio segment is to be optimised, in that a later band of the upstream audio segment and an earlier band of the downstream audio segment in the frequency range are tuned to each other. In addition, the database provides audio segments which are slightly different from one another but are suited for synthesising one and the same phoneme. In this manner, the natural variation of the speech is to be mimicked in order to achieve a higher quality of the synthesised speech. Both the use of the smoothing filter and the selection from a plurality of various audio segments for the realisation of a phoneme require a high computing power of the used system components in the implementation of this method. Moreover, the volume of the database increases due to the increased number of the provided audio segments. Furthermore, this method, too, does not provide for a ca-articulation dependent choice of the concatenation instance of individual audio segments, which may reduce the quality of the synthesised speech.
DE 693 18 209 T2 deals with formant synthesis. According to this document two multi-voice phones are connected with each other using an interpolation mechanism which is applied to a last phoneme of an upstream phone and to a first phoneme of a downstream phone, with the two phonema of the two phones being identical and with the connected phones are superposed to one phoneme. Upon the superposition, each of the curves describing the two phonema is weighted with a weighting function. The weighting function is applied to a band of each phoneme, which begins immediately after the start of the phoneme and ends immediately before the end of the phoneme. Thus, in the concatenation of phones described therein, the bands of the phonema, which form the transition between phones, correspond essentially to the respective entire phonema. This means, that portions of the phonema used for concatenation, invariably comprise all three bands, i.e. the respective initial co-articulation band, solo articulation band, and end co-articulation band. Consequently, D1 teaches an approach how the transitions between two phones are to be smoothed.
Moreover, according to this document the instance of the concatenation of two phones is established in such a manner that the last phoneme in the upstream phone and the first phoneme in the downstream phone completely overlap.
Principally, it is to be stated that DE 689 15 353 T2 aims at improving the tone quality, in that an approach is specified how to design the transition between two neighbouring sampling values. This is of particular relevance in the case of low sampling rates.
In the speech synthesis described in this document, waveforms are used which reproduce the phones to be concatenated. With waveforms for upstream phones, a corresponding final sampling value and an associated zero crossing point are established, while with waveforms for downstream phones, a corresponding first upper sampling value and an associated zero crossing point are established. Depending on these established sampling values and the associated zero crossing points, phones are connected with each other by means of maximal four different ways. The number of connection types is reduced to two, if the waveforms are generated by utilising the Nyquist theoreme. DE 689 15 353 T2 describes that the used band of waveforms extends between the last sampling value of the upstream waveform and the first sampling value of the downstream waveform. A variation of the duration of the used bands as a function of the waveforms to be concatenated, as it is the case with the invention, is not disclosed in D1.
In summary, it can be said that the state of the art allows to synthesise any phoneme sequences, but that the phoneme sequences synthesised in this manner do not possess an authentic voice quality. A synthesised phoneme sequence has an authentic voice quality if it cannot be distinguished by a listener from the same phoneme sequence spoken by a real speaker.
Methods are also known which use an inventory which comprises complete words and/or phrases in authentic voice quality as inventory elements. For the speech synthesis, these elements are brought into a desired order, with the possibilities of various voice sequences being limited to a high degree by the volume of such an inventory. The synthesis of any phoneme sequences is not possible with these methods.