This is a continuation of International Application PCT/JP2003/005492, with an international filing date of Apr. 28, 2003.
1. Field of the Invention
The present invention relates to a speech synthesis system wherein the most appropriate speech segment combination is found based on synthesis parameters from stored speech segment and concatenated, thereby generating a speech waveform.
2. Background Information
Speech synthesis technology is finding practical application in such fields as speech portal services and car navigation. Commonly, speech synthesis technology involves storing speech waveforms or parameterized speech waveforms, and appropriately concatenating and processing these to achieve a desired speech synthesis. The speech units to be concatenated are called synthesis units, and in previous speech synthesis technology, the primary method employed was to use a fixed-length synthesis unit.
For example, when a syllable is used as synthesis unit, the synthesis units for the synthesis target “Yamato” would be “ya”, “ma” and “to”. When a vowel-consonant-vowel concatenation (commonly called VCV) is used as the synthesis unit, joining at the midpoint of a vowel is assumed; the synthesis units for “yamato” would be “Qya”, “ama”, “ato”, and “oQ”, with “Q” signifying no sound.
Currently, however, the predominant method is to store a large inventory of speech data such as sentences and words spoken by a person, and in accordance with text input for synthesis, select and concatenate speech segment that has the longest matching segment therewith or speech segment not likely to sound discontinuous when concatenated (see, for example, Japanese Laid-open Patent Publication H10-49193). In this case, synthesis units are dynamically selected based on input text and speech data inventory. Methods of this type are collectively called corpus-based speech synthesis.
Because the same syllable can have different acoustical characteristics depending on the sounds before and after it, when a given sound is to be synthesized, a more natural speech synthesis is obtained by using speech segment such that the sounds before and after match over a wider range. Further, it is common to provide interpolatory segments for the purpose of making smooth joins when concatenating speech units. Because these interpolatory segments are artificial creations of speech segment that do not naturally exist, they lead to deterioration of speech quality. If the synthesis unit is lengthened, more appropriate speech segment can be used and the interpolatory segments that are the cause of speech quality deterioration can be made smaller, enabling improved quality of synthesized speech. However, preparing a database of all long speech units would result in a huge amount of data, for this reason making synthesis units a fixed length presents difficulties, and thus corpus-based methods as discussed above are prevalent.
FIG. 1 shows the configuration of a prior art example.
A speech segment storage unit 13 stores a large quantity of speech data such as sentences and words spoken by a person as speech waveforms or as parameterized waveforms. The speech segment storage unit 13 also stores index information for searching for stored speech segment.
Synthesis parameters are input into a phoneme selection unit 11. Synthesis parameters include speech unit sequences (synthesis target phoneme sequence), pitch frequency pattern, individual speech unit duration (phoneme duration) and power fluctuation pattern, as a result of input text analysis. The speech segment selection unit 11 selects the most appropriate combination of speech segment from the speech segment storage unit 13 based on input synthesis parameters. A speech synthesis unit 12 generates and outputs a speech waveform corresponding to the synthesis parameters using the combination of speech segment selected by the speech segment selection unit 11.
In a corpus-based method as described above, an evaluation function is established for the purpose of selection of the most appropriate speech segment from the speech segment inventory in the speech segment storage unit 13.
For example, let us suppose that the following two selections are possible as a speech segment combination satisfying the synthesis target phoneme sequence “yamato”:    (1) “yama”+“to”    (2) “ya”+“mato”
These two speech segment combinations have the same synthesis unit length, as (1) is a combination of four phonemes plus two phonemes, and (2) is a combination of two phonemes plus four phonemes. However, in the case of (1) the point of connection between the synthesis units is between “a” and “t”, and in the case of (2), the point of connection between the speech units is between “a” and “m”. The “t” sound, which is an unvoiced plosive, contains a no sound portion; if such an unvoiced plosive is made the connection point, there is less likelihood of discontinuity in the synthesized speech. Therefore, in this case, combination (1), which offers “t” as a connection point between speech units, is the appropriate choice.
When combination (1), i.e., “yama”+“to”, is selected, if the speech segment storage unit 13 has a plurality of phonemes for “to”, selection of a “to” having the phoneme “a” directly before it would be most appropriate for the speech segment sequence to be synthesized.
Each selected speech segment is converted into a pitch frequency pattern and phoneme duration determined in accordance with input synthesis parameters. In general, because voice quality deteriorates are caused by excessive pitch frequency conversion or phoneme duration conversion, it is preferable that speech segments having pitch frequency and phoneme duration close to the targeted pitch frequency and phoneme duration are selected from the speech segment storage unit 13.