In recent years, speech synthesis techniques are used for various apparatuses, such as a car navigation system. There are the following methods for synthesizing a speech waveform.
(1) Speech Synthesis Based on Source-Filter Models
Feature parameters of speech such as a formant and a cepstrum are used to configure a speech synthesis filter, where the speech synthesis filter is excited by an excitation signal acquired from fundamental frequency and voiced/unvoiced information so as to obtain a synthetic sound.
(2) Speech Synthesis Based on Waveform Processing
A speech waveform unit such as diphone or triphone is deformed to be a desired prosody (fundamental frequency, duration and power) and connected. The PSOLA (Pitch Synchronous Overlap and Add) method is representative.
(3) Speech Synthesis by Concatenation of Waveform
Speech waveform units such as syllables, words and phrases are connected.
In general, the (1) speech synthesis based on source-filter models and (2) speech synthesis based on waveform processing are suited to the apparatuses of which storage capacity is limited because these methods can render the storage capacity of a set of feature parameters of speech and a set of speech waveform units (segment set) smaller than the method of (3) speech synthesis by concatenation of waveform. As for the (3) speech synthesis by concatenation of waveform, it uses a longer speech waveform unit than the methods of (1) speech synthesis based on source-filter models and (2) speech synthesis based on waveform processing. Therefore, the method of (3) speech synthesis by concatenation of waveform requires the storage capacity of over ten MB to several hundred MB for the segment set per speaker, and so it is suited to the apparatuses of which storage capacity is abundant such as a general-purpose computer.
To generate a high-quality synthetic sound by the speech synthesis based on source-filter models or the speech synthesis based on waveform processing, it is necessary to create the segment set in consideration of differences in a phoneme environment. For instance, it is possible to generate a higher-quality synthetic sound by using a segment set (a triphone set) dependent on a phoneme context and having considered a surrounding phoneme environment rather than using a segment set (a monophone set) not dependent on the phoneme context and not having considered the surrounding phoneme environment. As for the number of segments of the segment set, there are several tens of kinds in the case of the monophone, several hundreds to a thousand and several hundreds of kinds in the case of the diphone, and several thousands to several tens of thousands in the case of the triphone although they may be different to a degree depending on a language and a definition of the monophone. Here, in the case of operating the speech synthesis on the apparatus of which resources are limited such as a cell-phone or a home electric appliance, there may be a need to reduce the number of segments due to a constraint on the storage capacity of an ROM and so on as to the segment set having considered the phoneme environment, such as the triphone or the diphone.
There are two approaches of reducing the number of segments of the segment set: a method of performing clustering to a set of voice units (entire speech database for training) for creating the segment set; and a method of applying the clustering to the segment set created by some method.
As for the former method, that is, the method of creating the segment set by performing the clustering to the entire speech database for training, the following methods are available: a method of performing data-driven clustering considering the phoneme environment to the entire speech database for training, acquiring a centroid pattern of each cluster and selecting it on synthesis to perform the speech synthesis (Japanese Patent No. 2583074 for instance); and a method of performing knowledge-based clustering considering the phoneme environment grouping identifiable phoneme sets (Japanese Patent Laid-Open 9-90972 specification, for instance).
As for the method of applying the clustering to the segment set created by some method, there is a method of reducing the number of segments by applying an HMnet to the segment set in units of CV or VC prepared in advance (Japanese Patent Laid-Open No. 2001-92481 for instance).
These conventional methods have the following problems.
First, according to the technique of Japanese Patent No. 2583074, the clustering is performed based only on a distance scale of a phoneme pattern (segment set) without using linguistic, phonological and phonetic specialized knowledge. Therefore, there are the cases where the centroid pattern is generated from phonologically dissimilar (unidentifiable) segment sets. If the synthetic sound is generated by using such a centroid pattern, there arise problems such as lack in intelligibility. To be more specific, it is necessary to perform the clustering by identifying phonologically similar triphones rather than simply clustering the phoneme environment such as the triphone.
Japanese Patent Laid-Open No. 9-90972 discloses a clustering technique considering the phoneme environment having grouped identifiable phoneme sets in order to deal with the problems of Japanese Patent No. 2583074. To be more precise, however, the technique used in Japanese Patent Laid-Open No. 9-90972 is a knowledge-based clustering technique, such as identifying a preceding phoneme of a long vowel with a preceding phoneme of a short vowel, identifying a succeeding phoneme of a long vowel with a succeeding phoneme of a short vowel, representing a preceding phoneme by one short vowel if the phoneme is an unvoiced stop, and representing a succeeding phoneme by one unvoiced stop if the succeeding phoneme is an unvoiced stop. The applied knowledge is also very simple, which is applicable only in the case where a unit of speech is the triphone. To be more specific, Japanese Patent Laid-Open No. 9-90972 has the problem that it is not possible to apply it to the segment set other than the triphone such as the diphone, deal with any other language than Japanese and have a desired number of segment sets (create scalable segment sets).
“English Speech Synthesis based on Multi-level context Oriented Clustering Method” by Nakajima (IEICE, SP92-9, 1992) (hereafter, “Non-Patent Document 1”) and “Speech Synthesis by a Syllable as a Unit of Synthesis Considering Environment Dependency—Generating Phoneme Clusters by Environment Dependent Clustering” by Hashimoto and Saito (Acoustical Society of Japan Lecture Articles, p. 245-246, September 1995) (hereafter, “Non-Patent Document 2”) disclose the method of using the clustering based on a phonological environment and the clustering based on the phoneme environment together in order to deal with the problems in Japanese Patent No. 2583074 and Japanese Patent Laid-Open No. 9-90972. According to Non-Patent Document 1 and Non-Patent Document 2, these inventions allow the clustering for identifying phonologically similar triphones, application to the segment set other than the triphone, handling of a language other than Japanese and creation of scalable segment sets. To obtain the segment set, however, the segment set is decided by performing the clustering to the entire speech segments for training in Non-Patent Document 1 and Non-Patent Document 2. Therefore, there is a problem that a spectral distortion in a cluster is considered but a spectral distortion at a connection point between the segments (concatenation distortion) is not considered. As it is described in Non-Patent Document 2 that a selection was made with an emphasis on consonants rather than vowels resulting in lower sound quality of the vowels, there is a problem that a selection result may not be appropriately obtained. To be more specific, on creating the segment set, it is not necessarily assured that the segment set selected by an automatic technique is optimal, but the sound quality can often be improved by manually replacing some segments thereof with other segments. For this reason, a required method is the method of performing the clustering to the segment set rather than performing the clustering to the entire speech segments for training.
Japanese Patent Laid-Open No. 2001-92481 discloses the method of reducing the number of segments by applying the HMnet to the selected segment set in units of CV or VC. However, the HMnet used by this method is context clustering by a maximum likelihood rule called a sequential state division method. To be more specific, the obtained HMnet may consequently have a number of phoneme sets shared in one state. However, how the phoneme sets are shared is completely data-dependent. Unlike Japanese Patent Laid-Open No. 9-90972 or Non-Patent Document 1 and Non-Patent Document 2, the identifiable phoneme sets are not grouped and the clustering is not performed with this group as a constraint. To be more specific, unidentifiable phoneme sets are shared as the same state, and so the same problem as in Japanese Patent No. 2583074 occurs.
In addition, there is the following problem relating to creation of the segment set of multiple speakers. Japanese Patent No. 2583074 discloses the method of performing the clustering by adding a factor of a vocalizer to phoneme environment factors. However, a feature parameter on performing the clustering is speech spectral information, which does not include prosody information such as voice pitch (fundamental frequency). This has a problem that, in the case of applying this technique to multiple speakers whose prosody information is considerably different among them, such as when creating the segment set for a male speaker and a female speaker, the clustering is performed while ignoring the prosody information, that is, not considering the prosody information applicable on the speech synthesis.