1. Field of the Invention
The present invention relates to a rule speech synthesis apparatus and method for performing speech synthesis by connecting parameters for speech segments by rules.
2. Related Background Art
A speech rule synthesis apparatus is available as an apparatus for generating speech from character train data. A feature parameter (e.g., LPC, PARCOR, LSP, or Mel Cepstrum; to be referred to as a parameter hereinafter) of a speech segment registered in a speech segment file in accordance with information of character train data is extracted and combined with a driver sound source signal (i.e., an impulse train in a voiced speech period and noise in a voiceless speech period) in accordance with a rate for generating synthesized speech. A composite result is supplied to a speech synthesizer to obtain synthesized speech. Types of speech segments are generally, a CV (consonant-vowel) segment, a VCV (vowel-consonant-vowel) segment, and a CVC (consonant-vowel-consonant) segment.
In order to synthesize speech segments, parameters must be interpolated. Even during interpolation performed when a parameter is abruptly changed, speech segments are simply connected by a line in an interpolation period according to a conventional technique, so that spectral information inherent to the speech segments is lost, and the resultant speech may be changed. In the conventional technique, a portion of speech uttered as a word or sentence is extracted as a period used as a speech segment.
For this reason, depending on the conditions under which human speech is synthesized from speech segments, speech powers greatly vary, and a gap is formed between the connected speech segments. As a result, synthesized speech sounds strange.
In a conventional method, when speech segments are to be connected in accordance with a mora length changed by the utterance speed of synthesized speech, a vowel, a consonant, and a transition portion between the vowel and the consonant are not considered separately and the entire speech segment data is expanded/compressed (reduced) at a uniform rate.
However, when parameters are simply expanded/reduced and connected to coincide with a syllable-beat-point pitch, vowels whose lengths tend to be changed with the utterance speed, phonemes /S/ and /F/, and explosive phonemes /P/ and /T/ are uniformly expanded/reduced without discriminating them from each other. The resultant synthesized speech is unclear and cannot be easily heard.
Durations of Japanese syllables are almost equal to each other. When speech segments are to be combined, parameters are interpolated to uniform syllable-beat-point pitches, and the resultant synthesized speech rhythm becomes unnatural.
A vowel may become voiceless depending on the preceding and following phoneme conditions. For example, when a word "issiki" is produced, the vowel "i" between "s" and "k" becomes voiceless. This can be achieved by rule synthesis in a conventional technique so that when the vowel /i/ of the syllable "shi" is to be synthesized, the driver sound source signal is changed into noise for synthesizing a voiceless sound from an impulse train for synthesizing a voiced sound without changing the parameter, thereby obtaining a voiceless sound.
The feature parameter of the voiced sound which is to be synthesized by an impulse sound source is forcibly synthesized by a noise sound source, and the synthesized speech becomes unnatural.
For example, when a rule synthesis apparatus using a VCV segment as a speech segment has six vowels and 25 consonants, 900 speech segments must be prepared, and a large-capacity is required. As a result, the apparatus becomes bulky.
There are three types of accents, i.e., a strongest stress start type, a strongest stress center type, and a flat type. For example, each of the strongest stress start and center type accents has three magnitudes, and the flat type accent has two magnitudes. The accent corresponding to the input text is determined by only a maximum of three magnitudes determined by the accent type. A dictionary is prestored with accent information.
In a conventional technique, the type of accent cannot be changed at the time of text input, and a desired accent is difficult to output.
A conventional arrangement having no dictionary of accent information corresponding to the input text to input the text together with the accent information is available. However, this arrangement requires difficult operations to be performed. It is not easy to understand the rising and falling of the accents by observing only an input text. Accents of a language different from those of Japanese do not coincide with Japanese accent types and are difficult to produce.