The present invention relates to text-to-speech conversion technology, more particularly to a method of intonation control in synthesized speech.
Text-to-speech conversion is a technology that converts ordinary text, of the type that people read every day, to spoken words, and outputs a speech signal. Because of its unlimited output vocabulary, this technology has potential uses in many fields, as a replacement for pre-recorded speech synthesis.
A typical speech synthesis system of the text-to-speech type has the structure shown in FIG. 1. The input is a machine-readable form of ordinary text. A text analyzer 101 analyzes the input text and generates a sequence of phonetic and prosodic symbols that use predefined character strings (referred to below as an intermediate language) to indicate pronunciation, accent, intonation, and other information. Incidentally, the illustrated system processes Japanese text, and the accent referred to herein is a pitch accent.
To generate the intermediate-language representation, the text analyzer 101 carries out linguistic processing such as morphemic analysis and semantic analysis, referring to a word dictionary 104 that gives the pronunciation, accent, and other information about each word. The resulting intermediate-language representation is processed by a parameter generator 102 to determine various synthesis parameters. These parameters from patterns of speech elements (sound types), phonation times (sound durations), phonation power (intensity of sound), fundamental frequency (voice pitch), and the like. The synthesis parameters are sent to a waveform generator 103, which generates synthesized speech waveforms by referring to a speech-element dictionary 105. The speech-element dictionary 105 is, for example, a read-only memory (ROM) storing speech elements and other information. The stored speech elements are the basic units of speech from which waveforms are synthesized. There are many types of speech elements, corresponding to different sounds, for example. The synthesized waveforms are reproduced through a loudspeaker and heard as synthesized speech.
The internal structure of the parameter generator 102 is shown in FIG. 2. The input intermediate language representation comprises phonetic character sequences accompanied by prosodic information such as accent position, positions of pauses, and so on. The parameters determined from this information include the time variations in pitch (referred to below as the pitch pattern), phonation power, the phonation time of each phoneme, the addresses of speech elements stored in the speech-element dictionary, and other parameters (referred to below as synthesis parameters) needed for synthesizing speech waveforms.
In the parameter generator 102, an intermediate language analyzer (ILA) 201 analyzes the input intermediate language, identifies word boundaries from word-delimiting symbols and breath-group symbols, and analyzes the accent symbols to find the moraic position of the accent nucleus of each word. A breath group is a unit of text that is spoken in one breath. A mora, in Japanese, is a short syllable or part of a long syllable. A voiced mora includes one vowel phoneme or the nasal /n/ phoneme. The accent nucleus, in Japanese, is the position where the pitch drops sharply. A word with an accent nucleus in the first mora is said to have a type-one accent. A word with an accent nucleus in the n-th mora is said to have a type-n accent (n being an integer greater than one), and these words are said to have a rising-and-falling accent. Words with no accent nucleus are said to have a type-zero accent or a flat accent; examples include the Japanese words xe2x80x98shimbunxe2x80x99 (newspaper) and xe2x80x98pasokonxe2x80x99 (personal computer).
A pitch pattern generator 202 calculates the pitch frequency of each voiced mora from the prosodic information in the intermediate language. In conventional Japanese text-to-speech conversion, pitch patterns are controlled by estimating the pitch frequency at the center of the vowel (or nasal /n/) in the mora, and using linear interpolation or spline interpolation between these positions; this technique is referred to as point-pitch modeling. Central vowel pitches are estimated by well-known statistical techniques such as Chikio Hayashi""s first quantification method. Control factors include, for example, the accent type of the word to which the vowel belongs, the position of the mora relative to the start of the word, the position of the mora within the breath group, and the phonemic type of the mora. The collection of estimated vowel-centered pitches will be referred to below as the point pitch pattern, while the entire pattern generated by interpolation will be referred to simply as the pitch pattern. The pitch pattern is calculated on the basis of the phonation time of each phoneme as determined by a phonation time generator 203, described below. If the user has specified a desired intonation level or a desired voice pitch, corresponding processing is carried out. Voice pitch is typically specifiable on about five to ten levels, for each of which a predetermined constant is added to the calculated pitch values. Intonation is typically specifiable on three to five levels, for each of which the calculated pitch values are partly multiplied by a predetermined constant. These control features are provided to enable specific words in a sentence to be emphasized or de-emphasized. Further information will be given later, as these are the features with which the present invention is concerned.
The phonation time generator 203 determines the length of each phoneme from the phonetic character sequences and prosodic symbols. Common methods of determining the phonation time include statistical techniques such as the above-mentioned quantification method, using the preceding and following phoneme types, or moraic position within the word or breath group. If the user has specified a desired speech speed, the phonation times are expanded or contracted accordingly. Speech speed can typically by specified on about five to ten levels; the calculated phonation times are multiplied by a predetermined constant for each level. Specifically, the phonation times are lengthened to slow down the speech, and shortened to speed up the speech.
A phonation power generator 204 calculates the amplitude of the waveform of each phoneme from the phonetic character sequences. The waveform amplitude values are determined empirically from factors such as the phoneme type (/a, e, i, o, u/, for example) and moraic position in the breath group. The phonation power generator 204 also determines the power transitions within each mora: the initial interval in which the amplitude value gradually increases, the steady-state interval that follows, and the final interval in which the amplitude value gradually decreases. Tables of numerical values are usually used to carry out this power control. If the user has specified a desired voice volume level, the amplitude values are increased or decreased accordingly. Voice volume can typically be specified on about ten levels. The amplitude values are multiplied by a predetermined constant for each level.
A speech element selector 205 determines the addresses in the speech-element dictionary 105 of the speech elements needed for expressing the phonetic character sequences. The speech elements stored in the speech-element dictionary 105 include elements derived from several types of voices, normally including at least one male voice and at least one female voice. The user specifies a desired voice type, and the speech element addresses are determined accordingly.
The pitch pattern, phonation powers, phonation times, and speech element addresses determined as described above are supplied to a synthesis parameter generator (SPG) 206, which generates the synthesis parameters. The synthesis parameters describe waveform frames with a typical length of about eight milliseconds (8 ms). The synthesis parameters are sent to the waveform generator 103.
The conventional techniques for controlling the intonation of a pitch pattern will now be described in more detail, with reference to the functional block diagram of the pitch pattern generator 202 shown in FIG. 3.
The intermediate language analyzer 201 supplies phonetic symbol sequences and prosodic symbols to a pitch estimator 301, which estimates the central vowel pitch of each voiced mora. The pitch is estimated by statistical methods, such as Hayashi""s first quantification method, on the basis of natural speech data, using a pre-trained prediction table 302. The point pitch pattern determined by the pitch estimator 301 is passed to a switching unit 303. If the user has not designated an intonation level, the switching unit 303 passes the point pitch pattern directly to a pitch-pattern interpolator 307. If the user has designated an intonation level, the point pitch pattern is passed to a minimum pitch finder 304. The minimum pitch finder 304 processes each word by finding the minimum central vowel pitch or point pitch in the word. An accent component calculator 305 calculates the difference between each point pitch and the minimum pitch (this difference is the accent component). A pitch modifier 306 then multiplies the accent component values by a coefficient determined according to the intonation level designated by the user, thereby modifying the point pitch pattern, and the modified pattern is supplied to the pitch-pattern interpolator 307. The pitch-pattern interpolator 307 carries out linear interpolation or spline interpolation, using the supplied point pitch pattern and the phonation times calculated by the phonation time generator 203, and sends the results to the synthesis parameter generator 206. If the user has specified a desired voice pitch, a corresponding constant is added to or subtracted from the point pitch values determined by the pitch estimator 301, although this is not indicated in the drawing.
Conventional pitch-pattern intonation control is illustrated in FIG. 4. The vertical axis represents pitch frequency in hertz (Hz); the horizontal axis represents time, with boundaries between phonemes indicated by vertical dashed lines. The illustrated example is for an utterance of the Japanese phrase xe2x80x98onsei shorixe2x80x99 (meaning xe2x80x98speech processingxe2x80x99). The black dots joined by thick lines are the point pitch pattern estimated by statistical techniques. Also indicated are modified point pitch patterns in which the user has specified intonation levels of x1.5 (white squares) and x0.5 (white dots). The prior art begins by searching for the minimum estimated pitch, which occurs in the vowel /i/ in the final mora xe2x80x98ri.xe2x80x99 This estimated pitch will be denoted xe2x80x98minxe2x80x99 below. Next, taking the /n/ phoneme for example, its pitch (A) relative to the minimum pitch is calculated. The pitch values (B) for x0.5 intonation and (C) for x1.5 intonation are then calculated from A as follows, an asterisk being used to indicate multiplication.                     B        =                              (                          A              *              0.5                        )                    +          min                                    (        1        )                                C        =                              (                          A              *              1.5                        )                    +          min                                    (        2        )            
The other point pitches are modified in the same way, working from the first mora to the last, to carry out intonation control.
One problem with the prior art of intonation control as described above is that, although the purpose is only to control intonation, the control process also raises or lowers the voice pitch. A comparison of the three pitch patterns in FIG. 4 makes it clear that the average pitch of the spoken phrase is raised in the x1.5 intonation pattern, and lowered in the x0.5 intonation pattern. When intonation control is designated only for selected words in a sentence, these words will be uttered at a higher or lower pitch than other words in the same sentence, destroying the balance of the synthesized speech in an extremely annoying manner.
Similarly, if a strong intonation level is specified for an entire sentence, or an entire text, this simultaneously raises the voice pitch, and if a weak intonation level is specified, the voice pitch is lowered. Consequently, the synthesized speech does not have the desired voice pitch.
A further problem is illustrated in FIG. 5, which shows point pitch patterns for each accent type in a word with five morae. Pitch frequency is indicated on the vertical axis; moraic position is indicated on the horizontal axis, the first mora being numbered zero (0). Reference characters from 401 to 405 designate accent types one to five, respectively. The type-five accent pattern 405, which lacks an accent nucleus, may also be treated as a type-zero accent pattern. More generally, in a word with n morae and a type-n or type-zero accent, the pitch does not fall steeply at any point. We shall focus here on a word with a type-zero accent. A basic feature of the type-zero accent is that the first mora is low in pitch and the second mora is high, but if the second mora represents a dependent sound, there is a strong tendency for the first mora and second mora to be pronounced together with a comparatively flat intonation, as if they were a single mora, forcing the pitch of the first mora to be relatively high. In Japanese, this occurs when the second mora is a dependent vowel, the second part of a long vowel, or the nasal /n/phoneme.
The prior art operates on the difference between each point pitch and the minimum pitch. When a word with a type-zero accent has one of the properties described above, the minimum pitch is the pitch of the first mora, which is pulled up by the second mora, so that the entire word is in a sustained high-pitch state and the accent is not accurately delineated. Adequate intonation control of such words is not achieved in the prior art. A user seeking to emphasize or de-emphasize these words by intonation control finds his or her efforts frustrated; hardly any perceptible intonation change can be produced.
Yet another problem is that the final pitch of the last word in a sentence tends to be much lower than the other pitches in the same sentence. When intonation control is carried out on this last word, since its minimum pitch occurs in the last mora, the differences between other pitches and this minimum pitch are extremely large. Accordingly, if the intonation level is raised, the pitch tends to become extremely high near the beginning of the word, causing the word to be uttered with an unnatural squeak.
A further problem is that the speech-element dictionary is normally created from speech data derived from meaningless words spoken in a monotone. This approach yields excellent clarity when the pitch of the synthesized speech is close to the monotone pitch, but as the pitch of the synthesized speech departs from that pitch, the synthesized words become increasingly distorted. Conventional intonation control makes the same type of modifications regardless of the general pitch level of the word being modified. If the general pitch level is high to begin with, and the intonation level is increased, the high pitches become still higher, leading to objectionable distortion and unnatural synthesized speech.
A first object of the present invention is to control the intonation of the last word in a sentence without producing extremely high pitches near the beginning of this last word.
A second object is to enable accurate intonation control to be carried out on all words, regardless of their accent type.
A third object is to carry out intonation control while maintaining a substantially invariant average pitch.
A fourth object is to carry out intonation control while staying close enough to a natural speaking pitch to avoid excessive distortion of synthesized speech sounds.
The invention provides a method of controlling the intonation of synthesized speech according to a designated intonation level, and text-to-speech conversion apparatus employing the invented method.
According to a first aspect of the invention, the method includes the following steps:
obtaining an original point pitch pattern of a word to be synthesized;
constructing a pitch slope line from the first point pitch to the last point pitch in the original point pitch pattern;
modifying each intermediate point pitch in the original point pitch pattern by finding a temporally matching point on the pitch slope line and adjusting the distance of the intermediate point pitch from the temporally matching point according to the designated intonation level; and
synthesizing a speech signal from the modified point pitch pattern.
This aspect of the invention achieves the first object stated above, and to some extent the fourth object. The first point pitch of each word is left unchanged, and other point pitches near the beginning of the word are not greatly increased.
According to a second aspect of the invention, the method includes the following steps:
obtaining an original point pitch pattern of a word to be synthesized;
generating a simplified pitch pattern by classifying each point pitch in the original point pitch pattern as high or low;
calculating a high pitch shift and a low pitch shift according to the designated intonation level;
adding the high pitch shift to each high point pitch in the original point pitch pattern, and adding the low pitch shift to each low point pitch in the original point pitch pattern, thereby obtaining a modified point pitch pattern; and
synthesizing a speech signal from the modified point pitch pattern.
In this aspect of the invention, the simplified pitch pattern may be generated according to the accent type of the word and the dependent or independent character of the second point pitch, thereby achieving the second object stated above.
The high and low pitch shifts may have equal magnitude and opposite sign, so that the third object is substantially achieved.
Alternatively, the high pitch shift may be set to zero when the word as a whole is high-pitched, and the low pitch shift may be set to zero when the word as a whole is low-pitched, thereby achieving the fourth object. Whether the word as a whole is high-pitched or low-pitched can be determined by comparing the maximum and minimum point pitches in the original point pitch pattern with a predetermined speech pitch.
According to a third aspect of the invention, the method includes the following steps:
obtaining an original point pitch pattern of a word to be synthesized;
designating an invariant pitch representing a typical pitch level of the synthesized speech;
calculating a constant value according to the invariant pitch;
modifying each point pitch in the original point pitch pattern according to the designated intonation level;
further modifying each point pitch by adding the calculated constant value; and
synthesizing a speech signal from the twice-modified point pitch pattern.
The constant value is calculated so that a point pitch having the invariant pitch in the original point pitch pattern also has the invariant pitch in the final modified point pitch pattern. The third object is thereby achieved. The first, third, and fourth objects are also achieved to some extent.