1. Field of the Invention
The present invention relates to a speech synthesis system and speech synthesis method which synthesize speech from a text.
2. Description of the Related Art
Text-to-speech synthesis is to artificially generate a speech signal from an arbitrary text. The text-to-speech synthesis is generally implemented by three stages, i.e., a language processing unit, a prosodic processing unit, and a speech synthesis unit.
First of all, the language processing unit performs morphological analysis and syntax analysis, and the like on an input text. The prosodic processing unit then performs accent and intonation processes and outputs phoneme string/prosodic information (information of prosodic features (a fundamental frequency, duration or phoneme duration time, power, and the like)). Finally, the speech synthesis unit synthesizes a speech signal from the phoneme string/prosodic information. Hence, a speech synthesis method used in the speech synthesis must be able to generate synthetic speech of an arbitrary phoneme symbol string with arbitrary prosodic features.
Conventionally, as such speech synthesis method, the following speech unit selection type speech synthesis method is known. First of all, this method divides an input phoneme string into a plurality of synthesis units (a synthesis unit string). Aiming at the input phoneme string/prosodic information, the method selects a speech unit from a large quantity of speech units stored in advance for each of the plurality of synthesis units. Speech is then synthesized by concatenating the selected speech units between synthesis units. For example, in the speech unit selection type speech synthesis method disclosed in JP-A 2001-282278 (KOKAI), the degree of deterioration in speech synthesis caused when speech is synthesized is expressed as a cost, and speech units are selected so as to reduce the cost calculated based on a predefined cost function. For example, this method quantifies deformation distortion and concatenation distortion, which are cased when speech units are edited and concatenated, by using a cost, and selects a speech unit string used for speech synthesis on the basis of the cost. The method then generates synthetic speech on the basis of the selected speech unit string.
In such a speech unit selection type speech synthesis method, in order to improve sound quality, it is very important to prepare various phonetic environments and as many variations of prosodic features by having more speech units. It is, however, difficult in terms of cost (or price) to entirely store a large amount of speech unit data in an expensive storage medium (e.g., a memory device) with high access speed. In contrast, if a large amount of speech unit data are entirely stored in a storage medium (e.g., a hard disk) with a relative low cost (or price) and low access speed, it takes too much time to acquire data. This makes it impossible to perform real-time processing.
The size of speech unit data is mostly occupied by waveform data. Under the circumstance, there is known a method of storing waveform data with a high frequency of use in a memory device, and other waveform data in a hard disk, and sequentially selecting speech units from the start on the basis of a plurality of sub-costs including a cost (access speed cost) associated with the speed of access to a storage device storing waveform data. For example, the method disclosed in JP-A 2005-266010 (KOKAI) can achieve relatively high sound quality because it allows the use of a large amount of speech units distributed in a memory and a hard disk. In addition, since this method preferentially selects speech units whose waveform data are stored in the memory with a high access speed, the method can shorten the time required to generate synthetic speech as compared with the method of acquiring all waveform data from the hard disk.
Although The method disclosed in JP-A 2005-266010 (KOKAI) can shorten the time required to generate synthetic speech on the average, it is possible that in a specific unit of processing, only speech units whose waveform data are stored in the hard disk may be selected. This makes it impossible to properly control the worst value of the generation time per unit of processing. A speech synthesis application which synthesizes speech and immediately uses the synthetic speech online generally repeats the operation of playing back the synthetic speech generated for a given unit of processing by using an audio device, and generating synthetic speech for the next unit of processing (and sending it to the audio device) during the playback. With this operation, synthetic speech is generated and played back online. In such an application, if the generation time of synthetic speech in a given unit of processing exceeds the time taken to play back synthetic speech for a preceding unit of processing, sound interruption occurs between units of processing. This may greatly degrade sound quality. It is therefore necessary to properly control the worst value of the time required to generate synthetic speech per unit of processing. In addition, according to the method disclosed in JP-A 2005-266010 (KOKAI), speech units whose waveform data are stored in the memory are selected more than necessary. This may result in failure to achieve optimal sound quality.
Under the restriction concerning the acquisition of speech unit data from storage media with different data acquisition speeds for a synthesis unit string (for example, the upper limit value of the number of times of acquisition of data from a hard disk per unit of processing), there is available a method of selecting an optimal speech unit string concerning the synthesis unit string. This method can reliably suppress the upper limit of the generation time of synthetic speech per unit of processing, and can generate synthetic speech with as high sound quality as possible within a predetermined generation time.
It is possible to search for an optimal speech unit string under the above restriction efficiently by the dynamic programming method in consideration of the restriction. If, however, there are many speech units, it still requires much calculation time. Therefore, a means for further speeding up the processing is required. A search under some restriction, in particular, requires more calculation amount than a search without any restriction, and hence it is necessary in particular to speed up the processing.
As a speeding up means, it is conceivable to perform a beam search with reference to a total cost as an evaluation reference for a speech unit string. In this case, in the process of sequentially developing speech unit strings for each synthesis unit by the dynamic programming method, W speech unit strings are selected in ascending order of total cost at the time point when the speech unit strings are developed up to a given synthesis unit, and only strings from the selected W speech unit strings are developed for the next synthesis unit.
The following problem arises when this method is applied to a beam search under the above restriction. In the first half of the process of sequentially developing speech unit strings, only speech unit strings including many speech units stored in a storage medium with a low access speed may be selected because of a low total cost. In this case, in the second half of the process, only speech units stored in a storage medium with a high access speed are allowed to be selected to satisfy the restriction. This problem arises especially when most of speech units are stored in a storage medium with a low access speed and the proportion of speech units stored in a storage medium with a high access speed is very low. As a consequence, sound quality unevenness occurs in generated synthetic speech, resulting in a deterioration in sound quality as a whole.