The present invention relates to a device that synthesizes speech, and more particularly to a speech synthesizing technique for synthesizing speech data of text including a fixed part and a variable part in combination with recorded speech and rule-based synthetic speech.
Generally, recorded speech refers to speech created based on recorded speech, and rule-based synthetic speech refers to speech synthesized from characters or code strings representative of pronunciation. Rule-based synthesis of speech, after performing linguistic analysis for inputted text to generate intermediate code indicating information on phonemic transcription and prosodic transcription, determines prosody parameters such as a fundamental frequency pattern (oscillation period of vocal chord corresponding to the height of voice) and phoneme duration (length of each phoneme corresponding to speaking rate), and generates a speech waveform matched to the prosody parameters by waveform generation processing. As a method of generating a speech waveform from the prosody parameters, a concatenative speech synthesizer is widely used that combines speech units corresponding to phonemes and syllables.
The flow of general rule-based synthesis is as follows. In linguistic analysis, from inputted text, phonemic transcription information representative of a row of phonemes (minimum unit for distinguishing the meaning of speech) and syllables (a kind of collection of the soundings of speech including the coupling of about one to three phonemes), and prosodic transcription information representative of prosodic transcription (information that specifies the strength of pronunciation) and intonation (information indicating interrogative and speaker's feelings) are generated as intermediate code. To generate the intermediate code, linguistic analysis by use of a dictionary, and morphological analysis are applied. Next, to conform to prosodic transcription information of the intermediate code, prosody parameters such as fundamental frequency patterns and phoneme duration are determined. The prosody parameters are generated based on a prosody model studied by previously using real voice and heuristics (control rule heuristically determined). Finally, a speech waveform matched to the prosody parameters is generated by waveform generation processing.
Since the rule-based synthesis can output any inputted text as speech, a more flexible speech guidance system can be built in comparison with a case where recorded speech is used. However, since the quality of rule-based synthetic speech is poorer than that of real voice, conventionally, there has been a problem in terms of quality when rule-based synthetic speech is introduced in a speech guidance system such as an on-vehicle car navigation system that uses recorded speech.
Accordingly, to realize a speech guidance system that uses rule-based synthetic speech, by using previously recorded speech for a fixed part and rule-based synthetic speech for a variable part, a method of combining the high quality of recorded speech and the flexibility of rule-based synthetic speech is used.
However, speech outputted in combination with recorded speech and rule-based synthetic speech has had a problem in that the discontinuity of timbres and prosodies between the recorded speech and the rule-based synthetic speech is perceived, so that parts of recorded speech have high quality but the whole is poor in quality.
As a method of eliminating the discontinuity of prosodies, a method is disclosed that uses characteristics of recorded speech to set parameters for rule-based synthetic speech (e.g., Japanese Patent Application Laid-Open No. 11-249677). A method is disclosed that enlarges parts of rule-based synthetic speech, taking the continuity of prosodies of a fixed part and a variable part into account (e.g., Japanese Patent Application Laid-Open No. 2005-321520).