(1) Field of the Invention
The present invention relates to a speech synthesizer that provides synthetic speech of high and stable quality.
(2) Description of the Related Art
As a conventional speech synthesizer that provides a strong sense of real speech, a device which uses a waveform concatenation system in which waveforms are selected from a large-scale element database and concatenated has been proposed (for example, see Patent Reference 1: Japanese Laid-Open Patent Publication No. 10-247097 (paragraph 0007; FIG. 1)). FIG. 1 is a diagram showing a typical configuration of a waveform concatenation-type speech synthesizer.
The waveform concatenating-type speech synthesizer is an apparatus which converts inputted text into synthetic speech, and includes a language analysis unit 101, a prosody generation unit 201, a speech element database (DB) 202, an element selection unit 104, and a waveform concatenating unit 203.
The language analysis unit 101 linguistically analyzes the inputted text, and outputs phonetic symbols and accent information. The prosody generation unit 201 generates, for each phonetic symbol, prosody information such as a fundamental frequency, duration time length, and power, based on the phonetic symbol and accent information outputted by the language analysis unit 101. The speech element DB 202 stores pre-recorded speech waveforms. The element selection unit 104 is a processing unit which selects an optimum speech element from the speech element DB 202 based on the prosody information generated by the prosody generation unit 201. The waveform concatenating unit 203 concatenates the elements selected by the element selection unit 104, thereby generating synthetic speech.
In addition, as a speech synthesis device that provides stable speech quality, an apparatus which generates parameters by learning statistical models and synthesizes speech is known (for example, Patent Reference 2: Japanese Laid-Open Patent Publication No. 2002-268660 (paragraphs 0008 to 0011; FIG. 1)). FIG. 2 is a diagram showing a configuration of a speech synthesizer which uses a Hidden Markov Model (HMM) speech synthesis system, which is a speech synthesis system based on a statistical model.
The speech synthesizer is configured of a learning unit 100 and a speech synthesis unit 200. The learning unit 100 includes a speech DB 202, an excitation source spectrum parameter extraction unit 401, a spectrum parameter extraction unit 402, and an HMM learning unit 403. The speech synthesis unit 200 includes a context-dependent HMM file 301, a language analysis unit 101, a from-HMM parameter generation unit 404, an excitation source generation unit 405, and a synthetic filter 303.
The learning unit 100 has a function for causing the context-dependent HMM file 301 to learn from speech information stored in the speech DB 202. Many pieces of speech information are prepared in advance and stored as samples in the speech DB 202. As shown by the example in the diagram, the speech information adds, to a speech signal, labels (arayuru (“every”), nuuyooku (“New York”), and so on) that identify parts, such as phonemes, of the waveform. The excitation source spectrum parameter extraction unit 401 and spectrum parameter extraction unit 402 extract an excitation source parameter sequence and a spectrum parameter sequence, respectively, per speech signal retrieved from the speech DB 202. The HMM learning unit 403 uses labels and time information retrieved from the speech DB 202 along with the speech signal to perform HMM learning processing on the excitation source parameter sequence and the spectrum parameter sequence. The learned HMM is stored in the context-dependent HMM file 301. Learning is performed using a multi-spatial distribution HMM as parameters of the excitation source model. The multi-spatial distribution HMM is an HMM expanded so that the dimensions of parameter vectors make different allowances each time, and pitch including a voiced/unvoiced flag is an example of a parameter sequence in which such dimensions change. In other words, the parameter vector is one-dimensional when voiced, and zero-dimensional when unvoiced. The learning unit performs learning based on this multi-spatial distribution HMM. More specific examples of label information are indicated below; each HMM holds these as attribute names (contexts).                phonemes (previous, current, following)        mora position of current phoneme within accent phrase        parts of speech, conjugate forms, conjugate type (previous, current, following)        mora length and accent type within accent phrase (previous, current, following)        position of current accent phrase and voicing or lack thereof before and after        mora length of breath groups (previous, current, following)        position of current breath group        mora length of the sentenceSuch HMMs are called context-dependent HMMS.        
The speech synthesis unit 200 has a function for generating read-aloud type speech signal sequences from an arbitrary piece of electronic text. The linguistic analysis unit 101 analyzes the inputted text and converts it to label information, which is a phoneme array. The from-HMM parameter generation unit 404 searches the context-dependent HMM file 301 based on the label information outputted by the linguistic analysis unit 101, and concatenates the obtained context-dependent HMMs to construct a sentence HMM. The excitation source generation unit 405 generates excitation source parameters from the obtained sentence HMM and further based on a parameter generation algorithm. In addition, the from-HMM parameter generation unit 404 generates a sequence of spectrum parameters. Then, a synthesis filter 303 generates synthetic speech.
Moreover, the method of Patent Reference 3 (Japanese Laid-Open Patent Publication No. 9-62295 (paragraphs 0030 to 0031; FIG. 1)) can be given as an example of a method of combining real speech waveforms and parameters. FIG. 3 is a diagram showing a configuration of a speech synthesizer according to Patent Reference 3.
In the speech synthesizer of Patent Reference 3, a phoneme symbol analysis unit 1 is provided, the output of which is connected to a control unit 2. In addition, a personal information DB 10 is provided in the speech synthesis unit, and is connected with the control unit 2. Furthermore, a natural speech element channel 12 and a synthetic speech element channel 11 are provided in the speech synthesizer. A speech element DB 6 and a speech element readout unit 5 are provided within the natural speech element channel 12. Similarly, a speech element DB 4 and a speech element readout unit 3 are provided within the synthetic speech element channel 11. The speech element readout unit 5 is connected with the speech element DB 6. The speech element readout unit 3 is connected with the speech element DB 4. The outputs of the speech element readout unit 3 and speech element readout unit 5 are connected to two inputs of a mixing unit 7, and output of the mixing unit 7 is inputted into an oscillation control unit 8. Output of the oscillation control unit 8 is inputted into an output unit 9.
Various types of control information are outputted from the control unit 2. A natural speech element index, a synthetic voice element index, mixing control information, and oscillation control information are included in the control information. First, the natural speech element index is inputted into the speech element readout unit 5 of the natural speech element channel 12. The synthetic speech element index is inputted into the speech element readout unit 3 of the synthetic speech element channel 11. The mixing control information is inputted into the mixing unit 7. The oscillation control information is inputted into the oscillation control unit 8.
This method is used as a method to mix synthetic elements based on parameters created in advance with recorded synthetic elements; in this method, natural speech elements and synthetic speech elements are mixed in CV units (units that are a combination of a consonant and a vowel, which correspond to one syllable in Japanese) while temporally changing the ratio. Thus it is possible to reduce the amount of information stored as compared to the case where natural speech elements are used, and possible to obtain synthetic speech with a lower amount of computation.
However, with the configuration of the above mentioned conventional waveform concatenation-type speech synthesizer, only speech elements stored in the speech element DB 202 in advance can be used in speech synthesis. In other words, in the case where there are no speech elements resembling the prosody generated by the prosody generation unit 201, speech elements considerably different from the prosody generated by the prosody generation unit 201 must be selected. Therefore, there is a problem in that the sound quality decreases locally. Moreover, the above problem will become even more apparent in the case where a sufficiently large speech element DB 202 cannot be built.
On the other hand, with the configuration of the conventional speech synthesizer based on statistical models (Patent Reference 2), synthesis parameters are generated statistically based on context labels for phonetic symbols and accent information outputted from the linguistic analysis unit 101, by using a hidden Markov model (HMM) learned statistically from a pre-recorded speech database 202. It is thus possible to obtain synthetic voice of stable quality for all phonemes. However, with statistical learning based on hidden Markov models, there is a problem in that subtle properties of each speech waveform (microproperties, which are subtle fluctuations in phonemes which affect the naturality of the synthesized speech, and so on) are lost through the statistical processing; the sense of true speech in the synthetic speech decreases, and the speech becomes lifeless.
Moreover, with the conventional parameter integration method, mixing of the synthetic speech element and the natural speech elements is used temporally in intervals, and thus there is a problem in that obtaining consistent quality over the entire time period is difficult, and the quality of the speech changes over time.
An object of the present invention, which has been conceived in light of these problems, is to provide synthetic speech of high and stable quality.