Speech synthesis devices are widely used in various fields. In particular, these devices can be used in automated inquiry and service systems, e.g. for providing information, reservation, notification, etc.; in call center and ordering systems; in voice commentary systems; in auxiliary and adaptive systems for blind and visually impaired persons, as well as for other categories of persons with disabilities; in developing voice portals; in education; in TV projects and advertisement projects, e.g. to produce presentations; in document preparation systems and editorial publication systems; in electronic phone secretaries; in multimedia and entertainment projects and in other fields.
The most widespread approach to speech synthesis is the compilation approuch, which provides the highest degree of similarity of synthesized speech to natural speech. According to compilation methods, synthesized speech based on user-defined text is produced by connecting units of pre-recorded natural speech of different length.
Historically, the first electronic synthesis systems were systems synthesizing speech from phonemes. Herein, the term “phoneme” refers to the smallest segmental unit of a language which has no individual vocabular or grammatical meaning. Said systems did not require large database capacity because the number of phonemes in any given language does not usually exceed several dozens. For example, according to various phonological schools, the Russian language contains from 39 to 43 phonemes. However, due to a variety of phoneme combinations coarticulation boundary effects at phoneme junctions should be taken into account when synthesizing text from phonemes. In order to account for such effects, a wide variety of coarticulation rules were used, but even in that case the speech produced by using such systems was of a low quality compared with natural speech.
Further studies carried out to solve the problems of coarticulation led to the development of systems synthesizing speech from larger units. In particular, various diphonic synthesis systems were developed. Herein, the term “diphone” refers to a section of speech between centers of adjacent phonemes. This approach required larger databases of 1500-2000 units. The clear advantage of diphonic synthesis compared with phonemic synthesis is the fact that a diphone contains all information defining the transition between two adjacent phonemes. However, a significant number of connection points (one for each diphone) led to the necessity of using complex smoothing algorithms to synthesize speech of acceptable quality. Furthermore, due to the fact that only one variation of each diphone was usually stored in the database, synthesized speech did not provide prosodic variability, and thus it was necessary to use sound duration and sound pitch control techniques to provide intonation tones.
Another approach for taking into account coarticulation effects is in using syllables as units for speech synthesis. The advantage of this solution is that most coarticulation effects occur within syllables rather than at their ends. Thanks to this syllable-by-syllable synthesis systems allow better quality of synthesized speech compared with aforementioned systems. However, due to a large number of syllables in language, syllable-by-syllable synthesis requires a substantial increase in database capacity. In order to decrease the amount of stored data, a half-syllabic synthesis (i.e. synthesis based on half-syllables produced by dividing syllables along their core) was used. However, this automatically led to more complicated connection of speech units in synthesis.
All aforementioned systems synthesized uniform speech with no intonation variability, because they had only one or just a few candidates for each synthesized speech sound due to limited database capacity and computational capability. In order to give synthesized speech an emotional overtone, various techniques of changing duration and pitch of speech sounds were used, however, the quality of such speech was insufficient. On the other hand, a relatively short length of speech units of natural speech used for synthesis resulted in a large number of connection points, and therefore, the necessity to use various smoothing and/or coarticulation techniques, which, on the one part, made synthesis systems more complicated, and, on the other part, did not allow the use of database elements without processing, making the synthesized speech sound less natural.
As computational devices grew in memory capacity and processing capability, it became possible to use larger databases containing continuous and non-uniform speech samples, and thus use longer and more diverse speech units, which provides increased quality of synthesized speech due to fewer connection points and intonation saturation of units used.
In WO 0126091, a method for producing a viable speech rendition of text is disclosed. According to this method, the text to be processed is split into words which are then compared with a list of words previously saved in a database as audio files. If a corresponding audio file is found for each word in the text, the speech is synthesized as a sequence of audio files including all words of the text. If, however, a corresponding audio file is not found for some words, such words are split into diphones and the desired word is produced by concatenating corresponding diphones which are also previously saved in the database. The advantage of said method is the use of relatively large speech units (i.e. words) for speech synthesis thus decreasing the number of connection points and making synthesized speech smoother. On the other hand, using a combination of corresponding diphones instead of words makes it possible to limit the database to only common enough words, thus allowing limitation of the database capacity. However, said approach does not provide synthesized speech comparable with natural speech in terms of quality. That is due to the fact that the database usually contains only one neutral pronunciation sample for each word, whele, in natural speech, a word can sound differently depending on its position within a sentence and intonation. This problem is marginally solved by recording additional variations of pronunciation of words into the database corresponding to their terminal position within a sentence. However, this method is in large incapable of synthesizing non-uniform speech with intonation overtones.
In recent years, developers of speech synthesis methods from user-defined text and corresponding synthesis devices have been focused on making synthesized speech more natural by providing it with prosodic flexibility and intonation overtones.
In the U.S. Pat. No. 6,665,641, variations of speech synthesizer are disclosed, the synthesizer comprising, for example, a speech database including speech waveforms; a speech waveform selector in communication with said database; and a speech waveform concatenator in communication with said database. Said selector searches for speech waveforms in the database based on certain criteria. Such criteria may be, for example, similarity in linguistic and prosodic attributes, wherein candidate sound waveforms are of a pitch within the range defined as a function of high-level linguistic features. Then said concatenator concatenates selected speech waveforms to obtain an output speech signal. This speech synthesizer provides speech based on previously recorded speech units while reproducing various prosodic attributes, however, the speech synthesizer does not take into account that physical parameters of a speech waveform are dependent from the intonation of the initial text and its parts, which does not allow precise reproduction of intonation of the speech.
In WO 2008147649, a method for synthesizing speech is disclosed. The method uses speech microsegments as speech units for synthesis. According to said method, an input text sequence is processed to obtain acoustic parameters. Then a number of candidate speech microsegment sets are selected from a speech database in accordance with the obtained acoustic parameters and a preferred sequence of speech microsegments for the obtained acoustic parameters is determined. Speech is synthesized from these speech microsegments. The duration of said microsegments can be no more than 20 ms, i.e. several times shorter than, for example, the duration of a diphone. It allows more frequent acoustic variations in the synthesized speech compared with phonemic and diphonic synthesis thus making the speech more natural. Several methods of obtaining the acoustic parameters based on processing the input text are disclosed in the application, however, the application also fails to disclose any mechanism of direct association between said parameters and intonation and finally does not provide synthesized speech with desired intonation overtones.
A closest prior art of the claimed invention is U.S. Pat. No. 7,502,739, disclosing a speech synthesis apparatus for synthesizing speech from a text and using a method of speech synthesis, comprising:
specifying at least one portion of a text;
determining the intonation of each portion;
associating target speech sounds with each portion;
determining physical parameters of the target speech sounds;
finding speech sounds most similar to the target speech sounds in terms of the physical parameters in the database;
synthesizing speech as a sequence of the found speech sounds.
According to this method, intonation models are additionally determined, intonation patterns corresponding to said models are found in an intonation pattern database and the found patterns are concatenated to produce an intonation pattern of the whole text. Then speech are synthesized based on said intonation pattern of the whole text.
The method of U.S. Pat. No. 7,502,739 allows a wide variability of intonation and speech overtones depending on fullness of the intonation pattern database. However, according to said method, the intonation of synthesized speech is a result of processing speech units by an intonation pattern and further concatenating the speech units to produce speech corresponding to the input text, which may worsen the natural sounding of the synthesized speech.
Therefore, despite developing a plurality of methods, devices and systems for compilation speech synthesis from user-defined text using different solutions to reproduce prosodic and intonation peculiarities, the problem of speech synthesis with improved intonation reproduction remains actual.