1. Field of the Invention
The present invention relates to text-to-speech conversion technologies for outputting a speech for a text that is composed of Japanese Kanji and Kana characters and, particularly, to a prosody control in high-speed reading.
2. Description of the Related Art
A text-to-speech conversion system, which receives a text composed of Japanese Kanji and Kana characters and converts it to a speech for outputting, is limitless in the output vocabularies and is expected to replace the record/playback speech synthesis technology in a variety of application fields.
FIG. 15 shows a typical text-to-speech conversion system. When a text of sentences composed of Japanese Kanji and Kana characters (hereinafter “text”) is inputted, a text analysis module 101 generates a phoneme and prosody character string or sequence from the character information. The “phoneme and prosody character string or sequence” herein used means a sequence of characters representing the reading of an input sentence and the prosodic information such as accent and intonation (hereinafter “intermediate language”). A word dictionary 104 is a pronunciation dictionary in which the reading, accent, etc. of each word are registered. The text analysis module 101 performs a linguistic process, such as morphemic analysis and syntax analysis, by referring to the pronunciation dictionary to generate an intermediate language.
Based on the intermediate language generated by the text analysis module 101, a prosody generation module 102 determines a composite or synthesis parameter composed of a voice segment (kind of a sound), a sound quality conversion coefficient (tone of a sound), a phoneme duration (length of a sound), a phoneme power (intensity of a sound), and a fundamental frequency (loudness of a sound, hereinafter “pitch”) and transmits it to a speech generation module 103.
The “voice segments” herein used mean units of voice connected to produce a composite or synthetic waveform (speech) and vary with the kind of sound. Generally, the voice segment is composed of a string of phonemes such as CV, VV, VCV, or CVC wherein C and V represent a consonant and a vowel, respectively.
Based on the respective parameters generated by the prosody generation module 102, the speech generation module 103 generates a composite or synthetic waveform (speech) by referring to a voice segment dictionary 105 that is composed of a read-only memory (ROM), etc., in which voice segments are stored, and outputs the synthetic speech through a speaker. The synthetic speech can be made by, for example, putting a pitch mark (as a reference point) on the voice waveform and, upon synthesis, superimposing it by shifting the position of the pitch mark according to the synthesis pitch cycle. The foregoing is a brief description of the text-to-speech conversion process.
FIG. 16 shows the conventional prosody generation module 102. The intermediate language inputted to the prosody generation module 102 is a phoneme character sequence containing prosodic information such as an accent position and a pause position. Based on this information, the module 102 determines a parameter for generating waveforms (hereinafter “synthesis parameter”) such as temporal changes of the pitch (hereinafter “pitch contour”), the voice power, the phoneme duration, and the voice segment addresses stored in a voice segment dictionary. In addition, the user may input a control parameter for designating at least one utterance property such as a utterance speed, pitch, intonation, intensity, speaker, and sound quality.
An intermediate language analysis unit 201 analyzes a character sequence for the input intermediate language to determine a word boundary from the breath group and word end symbols put on the intermediate language and the mora (syllable) position of an accent nuclear from the accent symbol. The “breath group” means a unit of utterance made in a breath. The “accent nuclear” means the position at which the accent falls. A word with the accent nuclear at the first mora is called “accent type one word”, a word with the accent nuclear at the n-th mora is called “accent type n word” and, generally, it is called “accent type uneven word”. Conversely, a word with no accent nuclear, such as “shinbun” or “pasocon”, is called “accent type 0” or “accent type flat” word. The information about such prosody is transmitted to a pitch contour determination unit 202, a phoneme duration determination unit 203, a phoneme power determination unit 204, a voice segment determination unit 205, and a sound quality coefficient determination unit 206, respectively.
The pitch contour determination unit 202 calculates pitch frequency changes in an accent or phrase unit from the prosody information on the intermediate language. The pitch control mechanism model specified by critically damped second-order linear systems, which is called “Fujisaki model”, has been used. According to the pitch control mechanism model, the fundamental frequency, which determines the pitch, is generated as follows. The frequency of a glottal oscillation or fundamental frequency is controlled by an impulse command issued every time a phrase is switched and a step command issued whenever the accent goes up or down. The impulse command becomes a gently falling curve from the head to the tail of a sentence (phrase component) because of a delay in the physiological mechanism. The step command becomes a locally very uneven curve (accent component). These components are made models as responses to the critically damped second-order linear systems. The logarithmic fundamental frequency changes are expressed as the sum of these components (hereinafter “intonation component”).
FIG. 17 shows the pitch control mechanism model. The log-fundamental frequency, lnFo(t), wherein t is the time, is formulated as follows.
                              ln          ⁢                                          ⁢                                    F              o                        ⁡                          (              t              )                                      =                              ln            ⁢                                                  ⁢                          F              min                                +                                    ∑                              i                =                1                            I                        ⁢                                          A                pi                            ⁢                                                G                  pi                                ⁡                                  (                                      t                    -                                          T                      oi                                                        )                                                              +                                    ∑                              j                =                1                            J                        ⁢                                          A                aj                            ⁢                              {                                                                            G                      aj                                        ⁡                                          (                                              t                        -                                                  T                          ij                                                                    )                                                        -                                                            G                      aj                                        ⁡                                          (                                              t                        -                                                  T                                                      2                            ⁢                            j                                                                                              )                                                                      }                                                                        (        1        )            wherein Fmin is the minimum frequency (hereinafter “base pitch”), I is the number of phrase commands in the sentence, Api is the amplitude of the i-th phrase command, Toi is the start time of the i-th phrase command, J is the number of accent commands in the sentence, Aaj is the amplitude of the j-th accent command, and T1j and T2j are the start and end times of the j-th accent command, respectively. Gpi(t) and Gaj(t) are the impulse response function of the phrase control mechanism and the step response function of the accent control mechanism, respectively, and given by the following equations.Gpi(t)=αi2texp(−αit)  (2)Gaj(t)=min[1−(1+βjt)exp(−βjt),θ]  (3)The above equations are the response functions at t≧0. If t<0, then Gpi(t)=Gaj(t).
In Equation (3), the symbol min[x, y] means that the smaller of x and y is taken, which corresponds to the fact that the accent component of a voice reaches the upper limit in a finite time. αi is the natural angular frequency of the phrase control mechanism for the i-th phrase command and, for example, set at 3.0. βj is the natural angular frequency of the accent control mechanism for the j-th accent command and, for example, set at 20.0. θ is the upper limit of the accent component and, for example, set at 0.9.
The units of the fundamental frequency and pitch control parameters, Api, Aaj, Toi, T1j, T2j, αi, βj, and Fmin, are defined as follows. The unit of Fo(t) and Fmin is Hz, the unit of Toi, T1j, and T2j is sec, and the unit of αi and βj is rad/sec. The unit of Api and Aaj is derived from the above units of the fundamental frequency and pitch control parameters.
The pitch contour determination unit 202 determines the pitch control parameter from the intermediate language. For example, the start time of a phrase command, Toi, is set at the position of a punctuation on the intermediate language, the start time of an accent command, T1j, is set immediately after the word boundary symbol, and the end time of the accent command, T2j, is set at either the position of the accent symbol or immediately before the word boundary symbol for an accent type flat word with no accent symbol. The amplitudes of phrase and accent commands, Api and Aaj, are determined in most cases by statistical analysis such as Quantification theory (type one), which is well known and its description will be omitted.
FIG. 18 shows the pitch contour generation process. The analysis result generated by the intermediate language analysis unit 201 is sent to a control factor setting section 501, where control factors required to predict the amplitudes of phrase and accent components are set. The information necessary for phrase component prediction, such as the number of moras in the phrase, the position within the sentence, and the accent type of the leading word, is sent to a phrase component estimation section 503. The information necessary for accent component prediction, such as the accent type of the accented phrase, the number of moras, the part of speech, and the position in the phrase, is sent to an accent component estimation section 502. The prediction of respective component values uses a prediction table 506 that has been trained by using statistical analysis, such as Quantification theory (type one), based on the natural utterance data.
The predicted results are sent to a pitch contour correction section 504, in which the estimated values Api and Aaj are corrected when the user designates the intonation. This control function is used to emphasize or suppress the word in the sentence. Usually, the intonation is controlled at three to five levels by multiplying each level with a predetermined constant. Where there is no intonation designation, no correction is made.
After both the phrase and accent component values are corrected, they are sent to a base pitch addition section 505 to generate a sequence of data according to Equation (1). Based on user's pitch designation, data for the designated level is retrieved as a base pitch from a base pitch table 507 for making addition. The logarithmic base pitch, lnFmin, represents the minimum pitch of a synthetic voice and is used to control the pitch of a voice. Usually, lnFmin is quantized at five to 10 levels and stored in the table. It is increased where the user desires overall loud voices. Conversely, it is lowered when soft voices are desired.
The base pitch table 507 is divided into two sections; one for men's voice and the other for women's voice. Based on user's speaker designation, the base pitch is selected for retrieval. Usually, men's voice is quantized at pitch levels between 3.0 and 4.0 while women's voice is at pitch levels between 4.0 and 5.0.
The phoneme duration control will be described. The phoneme duration determination unit 203 determines the phoneme length and the pause length from the phoneme character string and the prosodic symbol. The “pause length” means the length between phrases or sentences. The phoneme length determines the length of consonant and/or vowel which constitute a syllable and the silent length between closed sections that occurs immediately before a plosive phoneme such as p, t, or k. The phoneme duration and pause lengths are called generally “duration length”. The phoneme duration is determined by statistical analysis, such as Quantification theory (type one), based on the kind of phonemes adjacent to the target phoneme or the syllable position in the word or breath group. The pause length is determined by statistical analysis, such as Quantification theory (type one), based on the number of moras in adjacent phrases. Where the user designates the utterance speed, the phoneme duration is adjusted accordingly. Usually, the utterance speed is controlled at five to 10 levels by multiplying each level by a predetermined constant. When slow utterance is desired, the phoneme duration is lengthened while the phoneme duration is shortened for high utterance speed. The phoneme duration control is the subject matter of this application and will be described later.
The phoneme power determination unit 204 calculates the waveform amplitudes of individual phonemes from a phoneme character string. The waveform amplitudes are determined empirically from the kind of a phoneme, such as a, i, u, e, or o, and the syllable position in the breath group. The power transition within the syllable is also determined from the rising period when the amplitude gradually increases to the falling period when the amplitude decreases through the stationary-state period. The power control is made by using the coefficient table. When the user designates the intensity, the amplitude is adjusted accordingly. The intensity is controlled usually at 10 levels by multiplying each level by a predetermined constant.
The voice segment determination unit 205 determines the addresses, within the voice segment dictionary 105, of voice segments required to express a phoneme character string. The voice dictionary 105 contains voice segments of a plurality of speakers including both men and women and determines the address of a voice segment according to user's speaker designation. The voice segment data in the dictionary 105 is composed of various units corresponding to the adjacent phoneme environment, such as CV or VCV, so that the optimum synthesis unit is selected from the phoneme character string of an input text.
The sound quality determination unit 206 determines the conversion parameter when the user makes a sound quality conversion designation. The “sound quality conversion” means the process of signals for the voice segment data stored in the dictionary 105 so that the voice segment data is treated as the voice segment data of another speaker. Generally, it is achieved by linearly expanding or compressing the voice segment data. The expansion process is made by oversampling the voice segment data, resulting in the deep voice. Conversely, the compression process is made by downsampling the voice segment data, resulting in the thin voice. The sound quality conversion is controlled usually at five to 10 levels, each of which has been assigned with a re-sampling rate.
The pitch contour, phoneme power, phoneme duration, voice segment address, and expansion/compression parameters are sent to the synthesis parameter generation unit 207 to provide a synthesis parameter. The synthesis parameter is used to generate a waveform in a frame unit of 8 ms, for example, and sent to the waveform (speech) generation module 103.
FIG. 19 shows the speech generation process. A voice segment decoder 301 loads voice segment data from the voice segment dictionary 105 with a voice segment address of the synthesis parameter as a reference pointer and, if necessary, processes the signal. If a compression process has been applied to the dictionary 105, which contains voice segment data for voice synthesis, a decoding process is applied to the dictionary 105. The decoded voice segment data is multiplied by an amplitude coefficient in an amplitude controller 302 for making power control. The expansion/compression process of a voice segment is made in a voice segment processor 303 for making voice conversion. When a deep voice is desired, the voice segment is expanded and, when a thin voice is desired, the voice segment is compressed. In a superimposition controller 304, superimposition of the segment data is controlled according to the information such as the pitch contour and phoneme duration to generate a synthetic waveform. The superimposed data is written sequentially into a digital/analog (D/A) ring buffer 305 and transferred to a D/A converter with an output sampling cycle for output from a speaker.
FIG. 20 shows the phoneme duration determination process. The intermediate language analysis unit 201 feeds the analysis result into a control factor setting section 601, where the control factors required to predict the duration length of each phoneme or word are set. The prediction uses pieces of information such as the phoneme, the kind of adjacent phonemes, the number of moras in the phrase, and the position in the sentence, which are sent to a duration estimation section 602. The prediction of each of the accent and phrase component values uses a duration prediction table 604 that has been trained by using statistical analysis, such as Quantification theory (type one), based on the natural utterance data. The predicted result is sent to a duration correcting section 603 to correct the predicted value where the user designates the utterance speed. The utterance speed designation is controlled at five to 10 levels by multiplying each level by a predetermined constant. When a low utterance speed is desired, the phoneme duration is increased and, when a high utterance speed is desired, the phoneme duration is decreased. Suppose that there are five utterance speed levels and that Level 0 to Level 4 may be designated. A constant Tn for Level n is set as follows:To=2.0, T1=1.5, T2=1.0, T3=0.75, and T4=0.5Among the predicted phoneme durations, the vowel and pause lengths are multiplied by the constant Tn for the level n that is designated by the user. For Level 0, they are multiplied by 2.0 so that the generated waveform is lengthened while the utterance speed is shortened. For Level 4, they are multiplied by 0.5 so that the generated waveform is shortened and the utterance speed is raised. In the above example, Level 2 is made the normal utterance speed (default).
FIG. 21 shows synthetic waveforms to which the utterance speed control has been applied. The utterance speed control of a phoneme duration is made only for the vowel. The length between closed sections or of a consonant is considered almost constant regardless of the utterance speed. In Graph (a) at a high utterance speed, only the vowel is multiplied by 0.5 and the number of superimposed voice segments is subtracted to make the waveform. Conversely, in Graph (c) at a low utterance speed, only the vowel is multiplied by 1.5 and the number of superimposed voice segment is repeated for making the waveform. Regarding the pause length, the constant for the designated level is multiplied so that the lower the utterance speed, the longer the pause length while the higher the utterance speed, the shorter the pause length.
Let consider the case of a high utterance speed, which corresponds to Level 4 in the above example. In the text-to-speech conversion system, the maximum utterance speed means “Fast Reading Function (FRF)”. In the text, there are both important and not-so important portions for the user so that the not-so important portion is read at a high utterance speed and the important portion is read at the normal utterance speed for synthetic speech. Most of all latest model has such an FRF button. When this button is held down, the utterance speed is set at the maximum level for synthesizing a speech at the highest utterance speed and, when the button is released, the utterance speed is returned to the previous level.
The above technology, however, has the following disadvantages.
(A) When FRF is turned on, merely the phoneme duration is decreased. In other words, the length of a generated waveform is reduced so that an additional load is applied to the speech generation module. In the speech generation module, the speech data generated upon waveform superimposition is written sequentially into the D/A ring buffer. Consequently, if the waveform length is small, the time for waveform generation becomes short. When the waveform data length becomes a half, the process time must be made a half. If the phoneme duration length becomes a half, the calculation amount does not necessarily becomes a half so that the “voice interruption” phenomenon, in which the synthetic voice stops before completion, can take place where the waveform generation cannot keep up with the transfer to the D/A converter.
(B) Also, the pitch contour is compressed linearly. That is, the intonation changes at shorter cycles and the synthetic voice is so unnatural that it is hard to understand. FRF is used not to skip the text but read it fast so that it is not suitable for the synthetic voice that has a very uneven intonation. The intonation of a speech synthesized with FRF changes so violently that the speech is difficult to understand.
(C) In addition, the pause between sentences is compressed with the same rate as the rate for the phoneme duration so that the boundary between sentences becomes too vague to distinguish. Synthetic speeches are outputted rapidly one after another so that the speeches synthesized with FRF are not suitable for understanding the text contents.
(D) Moreover, the utterance speed becomes high over the entire text so that it is difficult to time releasing FRF. The ordinary FRF reads the not-so important portion at high speeds and synthesizes a speech at the normal speed for the important portion of a text. When the user releases the FRF button, a considerable part of the desired portion has been read already. This makes it necessary to reset the reading section before starting speech synthesis at the normal utterance speed. In order to turn on or off FRF, the user must make great efforts in sorting out the necessary portion from the unnecessary one by listening to the unclear speech.
Accordingly, it is an object of the invention to provide a method of controlling the fast reading function (FRF) in a text-to-speech conversion system capable of solving the above problems (A) through (D).
In order to solve the problem (A), according to an aspect of the invention, when the utterance speed is designated at the maximum speed or FRF is turned on, the phoneme duration and the pitch contour are determined in the phoneme duration and pitch contour determination units, respectively, of the prosody generation module by replacing the duration prediction table predicted by statistical analysis with the duration rule table that has been found from experience and such a sound quality conversion coefficient as to keep the sound quality is selected in the sound quality determination unit.
In order to solve the problem (B), according to another aspect of the invention, when the utterance speed is designated at the maximum speed, neither calculation of the accent and phrase components nor change of the base pitch are made.
In order to solve the problem (C), according to still another aspect of the invention, when the utterance speed is designated at the maximum speed, a signal sound is inserted between sentences.
In order to solve the problem (D), according to yet another aspect of the invention, when the utterance speed is designated at the maximum speed, at least the leading word of a sentence is read at the normal utterance speed.