1. Field of the Invention
The present invention relates to a speech synthesis apparatus that synthesizes a given speech by rules, in particular to a speech synthesis apparatus in which control of pitch contour of synthesized speech is improved in a text-to-speech conversion technique that outputs a mixed sentence including Chinese characters (called Kanji) and Japanese syllabary (Kana) used in our daily reading and writing, as the speech.
2. Description of the Related Art
According to the text-to-speech conversion technique, Kanji and Kana characters used in our daily reading and writing are input and converted into speech in order to be output. This technique has no limitation on the vocabulary to be output. Thus, the text-to-speech conversion technique is expected to be applied to various technical fields as an alternative technique to recording-reproducing speech synthesis.
When Kanji and Kana characters (hereinafter, referred to as a text) are input to a conventional speech synthesis apparatus, a text analysis module included therein generates a string of phonetic and prosodic symbols (hereinafter, referred to as an intermediate language) from the character information. The intermediate language describes how to read the input sentence, accents, intonation and the like as a character string. A prosody generation module then determines synthesizing parameters from the intermediate language generated by the text analysis module. The synthesizing parameters include a pattern of a phoneme, a duration of the phoneme and a fundamental frequency (pitch of voice, hereinafter simply referred to as pitch) and the like. The determined synthesizing parameters are output to a speech generation module. The speech generation module generates a synthesized waveform generated in the prosody generation module and a voice segment dictionary in which phonemes are accumulated, and then outputs synthetic sound through a speaker.
Next, a conventional process conducted by the prosody generation module is described in detail. The conventional prosody generation module includes an intermediate language analysis module, a phrase command determination module, an accent command determination module, a phoneme duration calculation module, a phoneme power determination module and a pitch contour generation module.
The intermediate language input to the prosody generation module is a string of phonetic characters with the position of an accent, the position of a pause or the like. From this string, parameters required for generating a waveform (hereinafter, referred to as waveform-generating parameters), such as time-variant change of the pitch (hereinafter, referred to as a pitch contour), the duration of each phoneme (hereinafter, referred to as the phoneme duration), and power of speech are determined. The intermediate language input is subjected to analysis of the character string in the intermediate language analysis module. In the analysis, word-boundaries are determined based on a symbol indicating a word""s end in the intermediate language, and a mora position of an accent nucleus is obtained based on an accent symbol.
The accent nucleus is a position at which the accent falls. A word having an accent nucleus positioned at the first mora is referred to as a word of accent type one while a word having an accent nucleus positioned at the n-th mora is referred to as a word of accent type n. These words are referred to an accented word. On the other hand, a word having no accent nucleus (for example, xe2x80x9cshin-bunxe2x80x9d and xe2x80x9cpasokonxe2x80x9d, which mean a newspaper and a personal computer in Japanese, respectively) are referred to as a word of accent type zero or an unaccented word.
The phrase command determination module and the accent command determination module determine parameters for response functions described later, based on a phrase symbol, an accent symbol and the like in the intermediate language. In addition, if a user sets intonation (the magnitude of the intonation), the magnitude of the phrase command and that of the accent command are modified in accordance with the user""s setting.
The phoneme duration calculation module determines the duration of each phoneme from the phonetic character string and sends the calculation result to the speech generation module. The phoneme duration is calculated using rules or a statistical analysis such as Quantification theory (type one), depending on the type of an adjacent phoneme. Quantification theory (type one) is a kind of factor analysis, and it can formulate the relationship between categorical and numerical values. In addition, in the case where the user sets a speech rate, the phoneme duration determination module is influenced by the speech rate. Normally, the phoneme duration becomes longer when the speech rate is made slower, while the phoneme duration becomes shorter when the speech rate is made faster.
The phoneme power determination module calculates the value of the amplitude of the waveform in order to send the calculated value to the speech generation module. The phoneme power is a power transition in a period corresponding to a rising portion of the phoneme in which the amplitude gradually increases, in a period corresponding to a steady state, and in a period corresponding to a falling portion of the phoneme in which the amplitude gradually decreases, and is calculated based on coefficient values in the form of a table.
These waveform generating parameters are sent to the speech generation module. Then, the synthesized waveform is generated.
Next, a procedure for generating a pitch contour in the pitch contour generation module is described.
FIG. 14 is a diagram explaining the generation procedure of the pitch contour and illustrates a model of a pitch control mechanism.
In order to sufficiently represent differences of intonation between various sentences, it is necessary to clarify the relationship between pitch and time in a syllable. The xe2x80x9cpitch control mechanism modelxe2x80x9d described by a critical damping second-order linear system is used as a model that can clearly describe the pitch contour in the syllable and can define the time-variant structure of the syllable. The pitch control mechanism model described in the present specification is the model explained below.
The pitch control mechanism model is a model that is considered to generate a fundamental frequency providing information about the voice pitch. The frequency of vibration of vocal cords, that is, the find a mental frequency, is controlled by an impulse command generated at every change of phrase, and a stepwise command generated at every rising and falling of an accent. Because of delay characteristics of physiological mechanisms, the impulse command of the phrase is a curve (phrase component) gradually descending from the front of a sentence to the end of the sentence, (see the waveform indicated with a broken line in FIG. 14), while the stepwise command of the accent is a curve (accent component) with local ups and downs, (indicated by a waveform with a solid line in FIG. 14). Each of these two components are modeled as a response of the critical damping second-order linear system of the corresponding command. The pattern of the time-variant change of the logarithmic fundamental frequency is expressed as a sum of these two components.
The logarithmic fundamental frequency F0(t) (t: time) is formulated as shown by Expression (1).                                                                         Ln                ⁢                                  xe2x80x83                                ⁢                                  F0                  ⁡                                      (                    t                    )                                                              =                              xe2x80x83                            ⁢                                                ln                  ⁢                                      xe2x80x83                                    ⁢                  Fmin                                +                                                      ∑                                          i                      =                      1                                        I                                    ⁢                                      Api                    ⁢                                          xe2x80x83                                        ⁢                                          Gpi                      ⁡                                              (                                                  t                          -                          T0i                                                )                                                                                            +                                                                                                        xe2x80x83                            ⁢                                                ∑                                      j                    =                    1                                    J                                ⁢                                  Aaj                  ⁢                                      {                                                                  Gaj                        ⁡                                                  (                                                      t                            -                            T1j                                                    )                                                                    -                                              Gaj                        ⁡                                                  (                                                      t                            -                            T2j                                                    )                                                                                      }                                                                                                          (        1        )            
In Expression (1), Fmin is the lowest frequency (hereinafter, referred to as a base pitch), I is the number of phrase commands in the sentence, Api is the magnitude of the i-th phrase command in the sentence, T0i is a start time of the i-th phrase command in the sentence, J is the number of accent commands in the sentence, Aaj is the magnitude of the j-th accent command in the sentence, and T1j and T2j are a start time and an end time of the j-th accent command, respectively. Gpi(t) and Gaj(t) are an impulse response function of the phrase control mechanism and a step response function of the accent control mechanism given by Expressions (2) and (3), respectively.
Gpi(t)=xcex1i2texp(xe2x88x92xcex1it)xe2x80x83xe2x80x83(2)
Gaj(t)=min[1xe2x88x92(1+xcex2jt)exp(xe2x88x92xcex2jt), xcex8]xe2x80x83xe2x80x83(3)
Expressions (2) and (3) are the response functions when txe2x89xa70; and when t less than 0, Gpi(t)=Gaj (t)=0. In addition, min [x, y] in Expression (3) means either one value of x and y that is smaller than the other. This corresponds to the fact that in actual speech, the accent component reaches an upper limit thereof within a finite time period. In the above, xcex1i is a natural angular frequency of the phrase control mechanism for the i-th phrase command, and is set to 3.0, for example. xcex2j is a natural angular frequency of the accent control mechanism for the j-th accent command, and is set to 20.0, for example. xcex8 is the upper limit of the accent component and is selected to be 0.9, for example.
The fundamental frequency and the pitch controlling parameters (Api, Aaj, T0i, T1j, T2j, xcex1i, xcex2j and Fmin) are defined as follows. [Hz] is used as a unit for F0(t) and Fmin; [sec] is used for T0i, T1j and T2j; and [rad/sec] is used for xcex1i and xcex2j. For Api and Aaj, values obtained when the units for the fundamental frequency and the pitch controlling parameters are defined as mentioned above are used.
In accordance with the generation procedure described above, the prosody generation module determines the pitch controlling parameters from the intermediate language. For example, the creation time T0i of the phrase command is set at a position where punctuation in the intermediate language exists; the start time T1j of the accent command is set at a position immediately after a word-boundary symbol; and the end time T2j of the accent command is set at a position where the accent symbol exists or at a position immediately before a symbol indicating a boundary between the word in question and the next word in a case where the word in question is an even accent word having no accent symbol.
Api and Aaj, indicating the magnitudes of the phrase command and the accent command, respectively are obtained as quantized values normally by text analysis, each having any of three levels. Thus, Api and Aaj are defined depending on the types of the phrase symbol and the accent symbol in the intermediate language. In some recent cases, the magnitudes of the phrase command and the accent command are not determined by rules, but are determined using a statistical analysis such as Quantification theory (type one). In a case where a user sets the intonation, the determined values Api and Aaj are modified.
Normally, the set intonation is controlled to be any of 3 to 5 levels by being multiplied by a constant value previously assigned to each level. In a case where the intonation is not set, the modification is not performed.
The base pitch Fmin expresses the lowest pitch of the synthesized speech and is used for controlling the voice pitch. Normally, Fmin is quantized into any of 5 to 10 levels and is stored in the form of a table. Fmin is increased when high-pitch voice is preferred, or is decreased when low-pitch voice is preferred, depending on the user""s preference. Therefore, Fmin is modified only when the user sets the value. The modifying process is performed in the pitch contour generation module.
The conventional pitch contour generating method mentioned above had a serious problem where the average pitch fluctuates to a large degree depending on the word-structure of the input text to be synthesized. The problem is explained below.
FIGS. 15A and 15B are diagrams illustrating a comparison of pitch contours having different accent types. When the pitch contours shown in FIGS. 15A and 15B are compared to each other, the average pitch in a text including successive unaccented words (FIG. 15A) is clearly different from that in a text including successive accented words (FIG. 15B). When a person recognizes the voice pitch, it is considered that the person relies on the average pitch, not on the base pitch. In many cases, the text-to-speech conversion technique is used not for the speech synthesis of a single sentence, but for the speech synthesis of a composite sentence. Therefore, there was a problem where the speech was hard to hear because the voice pitch raises or falls in some sentences, according to the conventional method.
Moreover, the user""s setting of the intonation is realized by multiplying the magnitudes of the phrase command and the accent command obtained by a predetermined procedure by a certain constant value. Therefore, in a case where the intonation is increased, it is likely that the voice pitch becomes in part extremely high in a certain sentence. Such synthesized speech is hard to hear and has a bias in tones. When such synthesized speech is heard, the part of the speech with a degraded quality is likely to remain in the ears.
It is an object of the present invention to provide a speech synthesis apparatus that can produce synthesized speech that is easy to hear, with fluctuation of the average pitch between sentences suppressed.
It is another object of the present invention to provide a speech synthesis apparatus that can prevent the voice pitch from being extremely high and can produce synthesized speech that is easy to hear.
According to an aspect of the present invention , a speech synthesis apparatus includes: a text analyzer operable to generate a phonetic and prosodic symbol string from character information of an input text; a word dictionary storing a reading and an accent of a word; a voice segment dictionary storing a phoneme that is a basic unit of speech; a parameter generator operable to generate synthesizing parameters including at least a phoneme, a duration of the phoneme and a fundamental frequency for the phonetic and prosodic symbol string, the parameter generator including a calculating means operable to obtain a sum of phrase components and a sum of accent components and to calculate an average pitch from the sum of the phrase components and the sum of the accent components, and a determining means operable to determine a base pitch from the average pitch; and a waveform generator operable to generate a synthesized waveform by making waveform-overlapping referring to the synthesizing parameters generated by the parameter generator and the voice segment dictionary.
In one embodiment of the present invention, the calculating means calculates an average value of the sum of the phrase commands and the sum of the accent commands as the average pitch. This calculation is undertaken based on creation times and magnitudes of the respective phrase commands, start times, end times and magnitudes of the respective accent commands. The determining means determines the base pitch in such a manner that a value obtained by adding the average value and the base pitch becomes constant.
According to another aspect of the present invention, a speech synthesis apparatus includes: a text analyzer operable to generate a phonetic and prosodic symbol string from character information of an input text; a word dictionary storing a reading and an accent of a word; a voice segment dictionary storing a phoneme that is a basic unit of speech; a parameter generator operable to generate synthesizing parameters including at least a phoneme, a duration of the phoneme and a fundamental frequency for the phonetic and prosodic symbol string, the parameter generator including a calculating means operable to overlap a phrase component and an accent component, obtain an approximation of a pitch contour from the overlapped phrase and accent components and calculate at least a maximum value of the approximation of the pitch contour, and a modifying means operable to modify a value of the phrase component and a value of the accent component by using at least the maximum value; and a waveform generator operable to generate a synthesized waveform by making waveform-overlapping referring to the synthesizing parameters generated by the parameter generator and the voice segment dictionary.
In one embodiment of the present invention, the calculating means calculates a maximum value and a minimum value of the pitch contour from a creation time and a magnitude of the phrase command and a start time, an end time and a magnitude of the accent command. The modifying means modifies the magnitude of the phrase component and the magnitude of the accent component in such a manner that the difference between the maximum value and the minimum value is made substantially the same as the intonation value set by a user.