1. Field of the Invention
The present invention relates to a speech synthesis apparatus that synthesizes a given speech based on rules, in particular to a speech synthesis apparatus in which control of the duration of a phoneme when a vowel is devoiced is improved using a text-to-speech conversion technique that outputs as speech a mixed sentence including Chinese characters (called Kanji) and Japanese syllabary (Kana) used in our daily reading and writing.
2. Description of the Related Art
According to the text-to-speech conversion technique, Kanji and Kana characters used in our daily reading and writing are input and are then converted into speech to be output. Using this technique, there is no limitation on vocabulary to be output. Thus, the text-to-speech conversion technique is expected to be applied to various technical fields as an alternative technique to recording-reproducing speech synthesis.
When Kanji and Kana characters used in our daily reading and writing are input to a conventional speech synthesis apparatus, a text analysis module included therein generates a string of phonetic and prosodic symbols (hereinafter, referred to as an intermediate language) from the character information. The intermediate language describes how to read the input sentence, accents, intonation and the like as a character string. A prosody generation module then determines synthesizing parameters from the intermediate language generated by the text analysis module. The synthesizing parameters include the pattern of phoneme, the duration of the phoneme and the fundamental frequency (pitch of voice, hereinafter simply referred to as pitch) and the like. The synthesizing parameters determined are output to a speech generation module. The speech generation module generates a synthesized waveform by referring to the various synthesizing parameters generated in the prosody generation module and a voice segment dictionary in which phonemes are stored, and then outputs synthesized sound through a speaker.
Next, a conventional process conducted by the prosody generation module is described in detail. The conventional prosody generation module includes an intermediate language analysis module, a pitch contour generation module, a devoicing determination module, a phoneme power determination module, a phoneme duration calculation module and a duration modification module.
The intermediate language input to the prosody generation module is a string of phonetic characters with the position of an accent, the position of a pause or the like indicated. From this string, parameters (hereinafter, referred to as a pitch pattern) required for generating a waveform such as time-variant change of the pitch, duration of each phoneme (hereinafter, referred to as a phoneme duration), and a power of speech (hereinafter, referred to as waveform-generating parameters), are determined. The intermediate language input is subjected to analysis of the character string in the intermediate language analysis module. In the analysis, a word-boundary is determined based on a symbol indicating a word""s end in the intermediate language, and a mora position of an accent nucleus is obtained based on an accent symbol.
The accent nucleus is a position at which the accent falls. A word having an accent nucleus at the first mora is referred to as a word of accent type one while a word having an accent nucleus at the n-th mora is referred to as a word of accent type n. These words are referred to an accented word. On the other hand, a word having no accent nucleus (for example, xe2x80x9cshin-bunxe2x80x9d and xe2x80x9cpasokonxe2x80x9d, which mean a newspaper and a personal computer in Japanese, respectively) is referred to as a word of accent type zero or an unaccented word.
The pitch contour generation module determines a parameter for each response function based on a phrase symbol, the accent symbol and the like described in the intermediate language. In addition, if the intonation (the magnitude of the intonation) or an entire voice pitch is set by a user, the pitch contour generation module modifies the magnitude of a phrase command and/or that of an accent command in accordance with the user""s setting.
The devoicing determination module determines whether or not a vowel is to be devoiced based on a phonetic symbol and the accent symbol in the intermediate language. The vowel devoicing determination module then sends the determination result to the phoneme power determination module and the phoneme duration calculation module. Devoicing the vowel will be described in detail later.
The phoneme duration calculation module calculates the duration of each phoneme from the phonetic character string and sends the calculation result to the duration modification module. The phoneme duration is calculated by using rules or a statistical analysis such as Quantification theory (type one), depending on the type of the adjacent phoneme. In a case where the user sets a speech rate, the duration modification module linearly stretches or shrinks the phoneme duration depending on the set speech rate. However, please note that such stretching or shrinking is normally performed only for the vowel.
The phoneme duration stretched or shrunk depending on the speech rate by the duration modification module is sent to the speech generation module.
The phoneme power determination module calculates the amplitude value of the waveform in order to send the calculated value to the speech generation module. The phoneme power is a power transition in a period corresponding to a rising portion of the phoneme in which the amplitude gradually increases, in a period corresponding to a steady state, and in a period corresponding to a falling portion of the phoneme in which the amplitude gradually decreases. The phoneme power is calculated from coefficient values in the form of a table.
The waveform generating parameters described above are sent to the speech generation module which generates the synthesized waveform.
Next, devoicing the vowel is described in detail.
When a person utters a word, air pushed out of the lungs is used as a sound source by creating an opening and closing movement of the vocal cords. Changes in resonance characteristics of the vocal tract occur by moving the chin, the tongue and lips in order to represent various phonemes. The pitch corresponds to the period of vibration of the vocal cords and thereafter a change of the pitch expresses the accents and the intonation. In addition to sounds generated by the vibration of the vocal cords, there are other types of sounds. A fricative, that is, a sound like noise, is generated by turbulence caused when air passes through a narrow space formed by a portion of the vocal tract and the tongue. Moreover, a plosive is generated by blocking the vocal tract with the tongue or the lips to temporarily stop the airflow and then releasing the airflow so as to generate an impulse-like sound.
The phonemes accompanied by the vibration of the vocal cords, that are the vowels, plosives xe2x80x9c/b, d, g/xe2x80x9d, fricatives xe2x80x9c/j, z/xe2x80x9d, nasal consonants and liquids such as xe2x80x9c/m, n, r/xe2x80x9d, are referred to as voiced sounds while the phonemes accompanied by no vibration of the vocal cords, that are plosives xe2x80x9c/p, t, k/xe2x80x9d, fricatives xe2x80x9c/s, h, f/xe2x80x9d, for example, are referred to as voiceless sounds. In particular, consonants are classified into voiced consonants accompanied by the vibration of the vocal cords or voiceless consonants without the vibration of the vocal cords. In the case of a voiced sound, a periodical waveform is generated by the vibration of the vocal cords. On the other hand, a noise-like waveform is generated in the case of a voiceless sound.
In common language, when the word xe2x80x9ckikuxe2x80x9d (that is, the Japanese word meaning chrysanthemum) is naturally uttered, for example, the first vowel xe2x80x9cixe2x80x9d in the word xe2x80x9ckikuxe2x80x9d is uttered using only breath without vibrating the vocal cords. This is a devoiced vowel.
In the text-to-speech conversion system, it is necessary to express a vowel by devoicing it in order to improve the quality of audibility. This determination is performed by the devoicing determination module. When a certain vowel is determined by the vowel devoicing determination module as being a vowel to be devoiced, the vowel is subjected to a special process in the phoneme power determination module and the phoneme duration calculation module.
The devoiced vowel is sent to the speech generation module with a phoneme power of 0 and a phoneme duration of 0, unlike a normal vowel. In this case, the phoneme duration calculation module adds the duration of the devoiced vowel to a duration of an associated consonant in order to prevent the duration of the devoiced vowel from being deleted. The speech generation module then generates the synthesized waveform using only the phoneme of the consonant without using the phoneme of the vowel.
The devoicing determination is normally performed in accordance with the following rules.
(1) A vowel xe2x80x9c/i/xe2x80x9d or xe2x80x9c/u/xe2x80x9d between voiceless consonants (including silence) is to be devoiced.
(2) However, if there is an accent nucleus, the above vowel should not be devoiced.
(3) However, if a previous vowel to the above vowel has already been devoiced, the above vowel should not be devoiced.
(4) If the above vowel appears at the end of a question, it should not be devoiced.
Please note that the above-mentioned rules are derived from general tendencies and therefore the devoicing does not always occur in accordance with these in actual utterance. Moreover, the above rules are shown as an example of rules because the devoicing rules change depending on individuals. Furthermore, in some cases, if a vowel is not devoiced because it does not fulfill rules (2), (3) and (4) although it fulfills rule (1), the vowel may be processed in a similar manner to the process for the devoiced vowel. For example, the duration of the vowel may be shortened or the amplitude value may be decreased.
Next, stretching or shrinking the waveform in the case of the devoiced vowel is described. The waveform stretching or shrinking is performed only in a period corresponding to a vowel having a periodical component. However, when the vowel is devoiced, the waveform stretching or shrinking is performed in a period corresponding to a consonant because the phoneme of the devoiced vowel is not used. The waveform stretching or shrinking by the phoneme of the vowel (voiced sound) is realized by overlapping an impulse response waveform generated by the vibration of the vocal cords, after shifting the response waveform by a repeat pitch. On the other hand, the waveform stretching or shrinking by the phoneme of the consonant (voiceless sound) was realized by inverting the waveform and then connecting the waveform at its termination to the inverted waveform.
According to the conventional duration control method for controlling the duration in the case of devoicing a vowel, the waveform is stretched or shrunk in a period corresponding to the consonant when the vowel is devoiced. Therefore, when the speech rate is made extremely slow, distinctness of the consonant for which the waveform stretching or shrinking is performed is noticeable degraded.
In addition, there is another problem where the rhythm of speech is damaged because the duration of the consonant is made extremely long, making the synthesized speech difficult to hear.
It is an object of the present invention to provide a speech synthesis apparatus that can reduce degradation of the quality of a phoneme of a devoiced vowel in the case of a slow speech rate so as to generate synthesized good quality speech with respect to the audibility.
It is another object of the present invention to provide a speech synthesis apparatus that can reduce the degradation of the quality of a phoneme of a devoiced vowel in the case of a slow speech rate, and can produce synthesized speech that has an undamaged rhythm of speech and is easy to hear and understand.
According to an aspect of the present invention, a speech synthesis apparatus includes: a text analyzer operable to generate a phonetic and prosodic symbol string from character information of input text; a word dictionary storing a reading and an accent of a word; a voice segment dictionary storing a phoneme that is a basic unit of speech; a prosody generator operable to generate synthesizing parameters including at least a phoneme, a duration of the phoneme and a fundamental frequency for the phonetic and prosodic symbol string, the prosody generator including a vowel devoicing determining means operable to determine whether or not a vowel devoicing process is to be performed and a duration modifying means operable to modify the duration of the phoneme depending on the speech rate set by a user, the vowel devoicing determining means determining that the vowel devoicing process is not performed when the set speech rate is slower than a predetermined rate; and a waveform generator operable to generate a synthesized waveform by making waveform-overlap-adding referring to the synthesizing parameters generated by the prosody generator and the voice segment dictionary.
In one embodiment of the present invention, the vowel devoicing determining means includes: a first determining means operable to make a first determination of devoicing a vowel using the input text such as a character-type and the accent, as a standard; and a second determining means operable to make a final determination of devoicing the vowel based on the result of the determination by the first determining means and the speech rate set by the user.
In another embodiment of the present invention, a threshold value used for determining that the vowel devoicing process is not performed by the vowel devoicing determining means can be set by the user.
In still another embodiment of the present invention, a threshold value used by the vowel devoicing determining means for determining that the vowel determining process is not performed is a half of a normal speech rate.
According to another aspect of the present invention, a speech synthesis apparatus includes: a text analyzer operable to generate a phonetic and prosodic symbol string from character information of an input text; a word dictionary storing a reading and accent of a word; a voice segment dictionary storing a phoneme that is a unit of speech; a prosody generator operable to generate synthesizing parameters including at least a phoneme, a duration of the phoneme and a fundamental frequency for the phonetic and prosodic symbol string, the prosody generator including a vowel devoicing determining means operable to determine whether or not a vowel devoicing process is performed and a duration modifying means operable to modify the diration of the phoneme depending on the speech rate set by a user and the result of the determination by the vowel devoicing determining means, wherein the duration modifying means does not stretch the duration of the phoneme for a voiceless sound beyond a predetermined limitation value; and a waveform generator operable to generate a synthesized waveform by making waveform-overlap-adding referring to the synthesizing parameters generated by the prosody generator and the voice segment dictionary.
In one embodiment of the present invention, the duration modifying means has a changeable limitation value depending on the type of the voiceless consonant.
In another embodiment of the present invention, the duration modifying means has a changeable limitation value depending on the length of the phoneme stored in the voice segment dictionary.