In a speech synthesis (text-to-speech synthesis) technology by a text synthesis technique of audibly outputting text data, it has been a great challenge to generate a natural intonation close to that of human speech.
A control method for an intonation, which has been widely used heretofore, is a method using a generation model of an intonation pattern by superposition of an accent component and a phrase component, which is represented by the Fujisaki Model. It is possible to associate this model with a physical speech phenomenon, and this model can flexibly express intensities and positions of accents, a retrieval of a speech tone and the like.
However, it has been complicated and difficult for this type of model to be associated with linguistic information of voice. Accordingly, it has been difficult to precisely control parameters which control accents, a magnitude of a speech tone component, temporal arrangement thereof and the like, which are actually used in the case of a speech synthesis. Consequently, in many cases, the parameters have been simplified excessively, and only fundamental prosodic characteristics have been expressed. This has become a cause of difficulty controlling speaker characteristics and speech styles in the conventional speech synthesis. For this, in recent years, a technique using a database (corpus base) established based on actual speech phenomena has been proposed in order to generate a more natural prosody.
As this type of background art, for example, there is a technology disclosed in the gazette of Japanese Patent Laid-Open No. 2000-250570 and a technology disclosed in the gazette of Japanese Patent Laid-Open No. Hei 10 (1998)-116089. In the technologies described in these gazettes, from among patterns of fundamental frequencies (F0) of intonations in actual speech, which are accumulated in a database, an appropriate F0 pattern is selected. The selected F0 pattern is applied to text that is a target of the speech synthesis (hereinafter, referred to as target text) to determine an intonation pattern, and the speech synthesis is performed. Thus, speech synthesis by a good prosody is realized as compared with the above-described generation model of an intonation pattern by superposition of an accent component and a tone component.
Any of such speech synthesis technologies using the F0 patterns determines or estimates a category which defines a prosody based on language information of the target text (e.g., parts of speech, accent positions, accent phrases and the like). The FO pattern belongs to the prosodic category in the database. Then this FO pattern is applied to the target text to determine the intonation pattern.
Moreover, when the plurality of F0 patterns belong to a predetermined prosodic category, one representative F0 pattern is selected by an appropriate method such as equation of the F0 patterns and adoption of the proximate sample to a mean value thereof (modeling), and is applied to the target text.
However, as described above, the conventional speech synthesis technology using the F0 patterns directly associates the language information and the F0 patterns with each other_in accordance with the prosodic category to determine the intonation pattern of the target text; and, therefore, the conventional speech synthesis technology has had limitations, such that quality of a synthesized speech depends on the determination of the prosodic category for the target text and whether an appropriate F0 pattern can be applied to target text incapable of being classified into prosodic categories of the F0 patterns in the database.
Furthermore, the language information of the target text, that is, such information concerning the positions of accents and morae and concerning whether or not there are pauses (silence sections) before and after a voice, has great effect on the determination of the prosodic category to which the target text applies. Hence, there has occurred a waste that an F0 pattern cannot be applied because these pieces of language information are different even if the F0 pattern has a pattern shape highly similar to that of intonation in actual speech.
Moreover, the conventional speech synthesis technology described above performs the equation and modeling of the pattern shape itself while putting importance on ease of treating the F0 pattern as data, and accordingly, has had limitations in expressing F0 variations of the database.
Specifically, a speech to be synthesized is undesirably homogenized into a standard intonation such as in a recital, and it has been difficult to flexibly synthesize a speech having dynamic characteristics (e.g., voices in an emotional speech, or a speech in dubbing, as characterizing a specific character).
Incidentally, while the text-to-speech synthesis is a technology aimed to synthesize a speech for an arbitrary sentence, there are many to which it is possible to apply relatively limited vocabularies and sentence patterns among fields to which the synthesized speech is actually applied. For example, response speeches in a Computer Telephony Integration system or car navigation system and a response in a speech dialogue function of a robot are typical examples of the fields.
In the application of the speech synthesis technology to these fields, it is also frequent that actual speech (recorded speech) is preferred over synthesized speech, based on a strong demand for the speech to be natural. Actual speech data can be prepared in advance for determined vocabularies and sentence patterns. However, a role of the synthesized speech is extremely large when taking a view of the ease of dealing with the synthesis of unregistered words, of additions and changes to the vocabularies and sentence patterns, and the like, and further, of extension to an arbitrary sentence.
From the above background, a method for enhancing the naturalness of the synthesized speech by use of recorded speech has been studied in the case of a task in which comparatively limited vocabularies are used. Examples of technology for mixing recorded speech and synthesized speech, for example, are disclosed in the following documents 1 to 3.
Document 1: A. W. black et al., “Limited Domain Synthesis,” Proc. of ICSLP 2000.
Document 2: R. E. Donovan et al., “Phrase Splicing and Variable Substitution Using the IBM Trainable Speech Synthesis System,” Proc. of ICASSP 2000.
Document 3: Katae et al., “Specific Text-to-speech System Using Sentence-prosody Database,” Proc. of the Acoustical Society of Japan, 2-4-6, Mar. 1996.
In the conventional technology disclosed in Document 1 or 2, the intonation of the recorded speech is basically utilized as it is. Hence, it is necessary to record in advance a phrase for use as the recorded speech in a context to be actually used. Meanwhile, the conventional technology disclosed in Document 3 is one of extracting in advance parameters of a model for generating the F0 pattern from an actual speech and of applying the extracted parameters to synthesis of a specific sentence having variable slots. Hence, it is possible to generate intonations also for different phrases if sentences having the phrases are in the same format, but there remain limitations that the technology can deal with only the specific sentence.
Here, consideration is made for insertion of the phrase of the synthesized speech between the phrases of the recorded speeches and connection thereof before and after the phrase of the recorded speech. Then, considering various speech behaviors in actual individual speeches, such as fluctuations, degrees of emphasis and emotion, and differences in intention of speeches, it cannot be said that an intonation of each synthesized phrase with a fixed value is always adapted to an individual environment of the recorded phrase.
However, in the conventional technologies disclosed in the foregoing Documents 1 to 3, these speech behaviors in the actual speeches are not considered, which results in great limitations to the intonation generation in the speech synthesis.
In this connection, it is an object of the present invention to realize a speech synthesis system which is capable of providing highly natural speech and is capable of reproducing speech characteristics of a speaker flexibly and accurately in generation of an intonation pattern of speech synthesis.
Moreover, it is another object of the present invention to, in the speech synthesis, effectively utilize F0 patterns of actual speeches accumulated in a database (corpus base) thereof in intonations of actual speeches by narrowing the F0 patterns without depending on a prosodic category.
Furthermore, it is still another object of the present invention to mix intonations of a recorded speech and synthesized speech to join the two smoothly.