1. Field of the Invention
Text-to-speech synthesis is to artificially create a speech signal from arbitrary text. The text-to-speech synthesis is normally implemented in three stages, i.e., a language processing unit, prosodic processing unit, and speech synthesis unit.
2. Description of the Related Art
Input text undergoes morphological analysis, syntactic parsing, and the like in the language processing unit, and then undergoes accent and intonation processes in the prosodic processing unit to output phoneme string and prosodic features or suprasegmental features (pitch or fundamental frequency, duration or phoneme duration time, power, and the like). Finally, the speech synthesis unit synthesizes a speech signal from the phoneme string and the prosodic features. Hence, a speech synthesis method used in the text-to-speech synthesis must be able to generate synthetic speech of an arbitrary phoneme symbol string with arbitrary prosodic features.
Conventionally, as such speech synthesis method, feature parameters having small synthesis units (e.g., CV, CVC, VCV, and the like (V=vowel, C=consonant)) are stored (these parameters will be referred to as typical speech units), and are selectively read out. And the fundamental frequencies and duration of these speech units are controlled, then these segments are connected to generate synthetic speech. In this method, the quality of synthetic speech largely depends on the stored typical speech units.
As a method of automatically and easily generating typical speech units suitably used in speech synthesis, for example, a technique called context-oriented clustering (COC) is disclosed (e.g., See Japanese Patent No. 2,583,074). In COC, a large number of pre-stored speech units are clustered based on their phonetic environments, and typical segments are generated by fusing speech units for respective clusters.
The principle of COC is to divide a large number of speech units assigned with phoneme names and environmental information (information of phonetic environments) into a plurality of clusters that pertain to phonetic environments on the basis of distance scales between speech units, and to determine the centroids of respective clusters as typical speech units. Note that the phonetic environment is a combination of factors which form an environment of the speech unit of interest, and the factors include the phoneme name, preceding phoneme, succeeding phoneme, second succeeding phoneme, fundamental frequency, duration, power, presence/absence of stress, position from an accent nucleus, time from breath pause, utterance speed, emotion, and the like of the speech unit of interest.
Since phonemes in actual speech undergo phonological changes depending on phonetic environments, typical segments are stored for a plurality of respective clusters that pertain to phonetic environments, thus allowing generation of natural synthetic speech in consideration of the influence of phonetic environments.
As a method of generating typical speech units with higher quality, a technique called a closed loop training method is disclosed (e.g., see Japanese Patent No. 3,281,281). The principle of this method is to generate typical speech units that minimize distortions from natural speech on the level of synthetic speech which is generated by changing the fundamental frequencies and duration. This method and COC have different schemes for generating typical speech units from a plurality of speech units: the COC fuses segments using centroids, but the closed loop training method generates segments that minimize distortions on the level of synthetic speech.
Also, a segment selection type speech synthesis method, which synthesizes speech by directly selecting a speech segment string from a large number of speech units using the input phoneme string and prosodic information (information of prosodic features) as a target, is known. The difference between this method and the speech synthesis method that uses typical speech units is to directly select speech units from a large number of pre-stored speech units on the basis of the phoneme string and prosodic information of input target speech without generating typical speech units. As a rule upon selecting speech units, a method of defining a cost function which outputs a cost that represents a degree of deterioration of synthetic speech generated upon synthesizing speech, and selecting a segment string to minimize the cost is known. For example, a method of digitizing deformation and concatenation distortions generated upon editing and concatenating speech units into costs, selecting a speech unit sequence used in speech synthesis based on the costs, and generating synthetic speech based on the selected speech unit sequence is disclosed (e.g., see Jpn. Pat. Appln. KOKAI Publication No. 2001-282278). By selecting an appropriate speech unit sequence from a large number of speech units, synthetic speech which can minimize deterioration of sound quality upon editing and concatenating segments can be generated.
The speech synthesis method that uses typical speech units cannot cope with variations of input prosodic features (prosodic information) and phonetic environments since limited typical speech units are prepared in advance, thus there occurs deteriorating sound quality upon editing and concatenating segments.
On the other hand, the speech synthesis method that selects speech units can suppress deterioration of sound quality upon editing and concatenating segments since it can select them from a large number of speech units. However, it is difficult to formulate a rule that selects a speech unit sequence that sounds naturally as a cost function. As a result, since an optimal speech unit sequence cannot be selected, the sound quality of synthetic speech deteriorates. The number of speech units used in selection is too large to practically eliminate defective segments in advance. Since it is also difficult to reflect a rule that removes defective segments in design of a cost function, defective segments are accidentally mixed in a speech unit sequence, thus deteriorating the quality of synthetic speech.