The present invention relates to a voice synthesizing method and a voice synthesizer and system which perform the method. More particularly, the invention relates to a voice synthesizing method which converts stereotypical sentences having nearly fixed contents to voice-synthesized sentences synthesized by a voice, a voice synthesizer which executes the method and a method of producing data necessary to achieve the method and voice synthesizer. Particularly, the invention is used in a communication network that comprises portable terminal devices each having a voice synthesizer and data communication means which is connectable to the portable terminal devices.
In general, voice synthesis is a scheme of generating a voice wave from phonetic symbols (voice element symbols) indicating the contents to be voiced, a time serial pattern of pitches (fundamental frequency pattern) which are physical measures of the intonation of voices, and the duration and power (voice element intensity) of each voice element. Hereinafter the three parameters, the fundamental-frequency pattern, the duration of a voice element and the voice element intensity, are generically called “prosodic parameters” and the combination of a voice element symbol and the prosodic parameters is generically called “prosody data”.
Typical methods of generating voice waves are a parameter synthesizing method that drives a parameter which imitates the characteristics of a vocal tract of a voice element using a filter, and a wave concatenation method that generates waves by extracting pieces indicative of the characteristics of individual voice elements from a generated human voice wave and connecting them. Producing “prosody data” is important in voice synthesis. The voice synthesizing methods can be generally used for most languages including Japanese.
Voice synthesis needs to somehow acquire the prosodic parameters corresponding to the contents of a sentence to be voice-synthesized. In a case where the voice synthesizing technology is adapted to the readout or the like of electronic-mail and electronic newspaper, for example, an arbitrary sentence should be subjected to language analysis to identify the boundary between words or phrases and the accent type of a phrase should be determined after which prosodic parameters should be acquired from accent information, syllable information or the like. Those basic methods relating to automatic conversion have already been established and can be achieved by a method disclosed in “A Morphological Analyzer For A Japanese Text To Speech System Based On The Strength Of Connection Between Words” (in the Journal of the Acoustical Society of Japan, Vol. 51, No. 1, 1995, pp. 3–13).
Of the prosodic parameters, the duration of a syllable (voice element) varies due to various factors including a context where the syllable (voice element) is located. The factors that influence the duration include the restrictions on articulation, such as the type of the syllable, timing, the importance of a word, indication of the boundary of a phrase, the tempo in a phrase, the overall tempo, and the linguistic restriction, such as the meaning of a syntax. A typical way to control the duration of a voice element is to statistically analyze the degrees of influence of the factors on duration data that is actually observed, and use a rule acquired by the analysis. For example, “Phoneme Duration Control for Speech Synthesis by Rule” (The Transaction of the Institute of Electronics, Information and Communication Engineers, 1984/7, Vol. J67-A, No. 7) describes a method of computing the prosodic parameters. Of course, computation of the prosodic parameters is not limited to this method.
While the above-described voice synthesizing method relates to a method of converting an arbitrary sentence to prosodic parameters or a text voice synthesizing method, there is another method of computing prosodic parameters in a case of synthesizing a voice corresponding to a stereotypical sentence having predetermined contents to be synthesized. Voice synthesis of a stereotypical sentence, such as a sentence used in voice-based information notification or a voice announcement service using a telephone is not as complex as voice synthesis of any given sentence. It is therefore possible to store prosody data corresponding to the structures or patterns of sentences in a database and search the stored patterns and use prosodic parameters of a pattern similar to a pattern in question at the time of computing the prosodic parameters. This method can significantly improve the naturalness of a synthesized voice as compared with a synthesized voice which is acquired by the text voice synthesizing method. For example, Japanese Patent Laid-open No. 249677/1999 discloses the prosodic-parameter computing method which uses that method.
The intonation of a synthesized voice depends on the quality of prosodic parameters. The speech style of a synthesized voice, such as an emotional expression or a dialect, can be controlled by adequately controlling the intonation of a synthesized voice.
The conventional voice synthesizing schemes involving stereotypical sentences are mainly used in voice-based information notification or a voice announcement service using a telephone. In the actual usage of those schemes, however, synthesized voices are fixed to one speech style and multifarious voices, such as dialects and voices in foreign languages, cannot be freely synthesized as desired. There are demands for installing dialects or the like into devices which require some amusement, such as cellular phones and toys, and the scheme of providing voices in foreign languages are essential in the internationalization of the devices.
However, the conventional technology is not developed in consideration of arbitrary conversion of voice contents to each dialect or expression at the time of voice synthesis. Further, the conventional technology makes it hard for a third party other than a system user and operator to freely prepare the prosody data. Furthermore, a device which suffers considerably limited resources for computation, such as a cellular phone, cannot synthesize voices with various speech styles.