Text speech synthesis may artificially generate a speech signal from arbitrary sentences (text data). For example, this technique is disclosed in JP-A (Kokai) No. 08-254993 (page 4 and FIG. 1). A speech synthesis apparatus to realize such text speech synthesis is composed by three elements, i.e., a language processing unit, a prosody processing unit, and a speech synthesis unit.
First, in the language processing unit, an input text is morphologically and syntactically analyzed. Next, in the prosody processing unit, accent and intonation of the analyzed text are processed, and information such as a phoneme sequence, a fundamental frequency, and a phoneme segmental duration are calculated. Last, in the speech synthesis unit, by concatenating speech unit data (feature parameter and speech waveform) based on a fundamental frequency and a phoneme segmental duration (calculated by the prosody processing unit), synthesized speech is generated. In this case, the speech unit data is previously stored for each synthesis unit (For example, a phoneme or a syllable) as a speech connection unit to generate the synthesized speech.
As a method for synthesizing a high quality speech, a large number of speech unit data is previously stored, suitable speech unit data is selected from the stored speech unit data by a prosody/a phoneme environment of the input text, and a synthesized speech is generated by modifying and concatenating the selected speech unit data. This method is disclosed in JP-A (Kokai) No. 2001-282278 (page 3 and FIG. 2). In this method, a cost function to estimate a distortion degree of quality of the synthesized speech (generated by modifying and concatenating speech units) is previously defined. By selecting a plurality of speech units having the lowest cost function from a large number of speech units, a synthesized speech of high quality can be realized.
In the above speech synthesis method, if an expensive semiconductor memory such as RAM is used as a memory medium to store a large number of speech unit data, the cost becomes high. Accordingly, a large capacity memory medium such as a hard disk drive (HDD) is often used. However, in case of storing speech unit data in the HDD, it takes a long time to read the speech unit data from the HDD. As a result, processing time becomes long, and real time processing is difficult.
In order to solve this problem, a partial copy of speech unit data on the HDD is located on another memory, and a plurality of speech units are selected from the memory on condition that the speech unit on the memory is easy to access. As a result, the number of access from the HDD is reduced and the processing time is reduced. This technique is disclosed in JP-A (Kokai) No. 2005-266010. This speech unit selection is realized by designing the cost function of which value becomes large by penalty of selecting the speech unit from the HDD.
In the above technique, speech unit data on the HDD is hard to be selected by the cost function with penalty, and a number of access from the HDD is reduced. In this case, even if a speech unit suitable for quality is stored in the HDD, another speech unit stored in the memory is often selected. Accordingly, in comparison with the cost function without penalty, speech quality falls. Furthermore, a memory to store a partial copy of the speech unit data is necessary, and the hardware cost increases.