1. Field of the Invention
The present invention relates generally to a speech synthesis method for text-to-speech synthesis, and more particularly to a speech synthesis method for generating a speech signal from information such as a phoneme symbol string, a pitch and a phoneme duration.
2. Description of the Related Art
A method of artificially generating a speech signal from a given text is called xe2x80x9ctext-to-speech synthesis.xe2x80x9d The text-to-speech synthesis is generally carried out in three stages comprising a speech processor, a phoneme processor and a speech synthesis section. An input text is first subjected to morphological analysis and syntax analysis in the speech processor, and then to processing of accents and intonation in the phoneme processor. Through this processing, information such as a phoneme symbol string, a pitch and a phoneme duration is output. In the final stage, the speech synthesis section synthesizes a speech signal from information such as a phoneme symbol string, a pitch and phoneme duration. Thus, the speech synthesis method for use in the text-to-speech synthesis is required to speech-synthesize a given phoneme symbol string with a given prosody.
According to the operational principle of a speech synthesis apparatus for speech-synthesizing a given phoneme symbol string, basic characteristic parameter units (hereinafter referred to as xe2x80x9csynthesis unitsxe2x80x9d) such as CV, CVC and VCV (V=vowel; C=consonant) are stored in a storage and selectively read out. The read-out synthesis units are connected, with their pitches and phoneme durations being controlled, whereby a speech synthsis is performed. Accordingly, the stored synthesis units substantially determine the quality of the synthesized speech.
In the prior art, the synthesis units are prepared, based on the skill of persons. In most cases, synthesis units are sifted out from speech signals in a trial-and-error method, which requires a great deal of time and labor. Jpn. Pat. Appln. KOKAI Publication No. 64-78300 (xe2x80x9cSPEECH SYNTHESIS METHODxe2x80x9d) discloses a technique called xe2x80x9ccontext-oriented clustering (COC)xe2x80x9d as an example of a method of automatically and easily preparing synthesis units for use in speech synthesis.
The principle of COC will now be explained. Labels of the names of phonemes and phonetic contexts are attached to a number of speech segments. The speech segments with the labels are classified into a plurality of clusters relating to the phonetic contexts on the basis of the distance between the speech segments. The centroid of each cluster is used as a synthesis unit. The phonetic context refers to a combination of all factors constituting an environment of the speech segment. The factors are, for example, the name of phoneme of a speech segment, a preceding phoneme, a subsequent phoneme, a further subsequent phoneme, a pitch period, power, the presence/absence of stress, the position from an accent nucleus, the time from a breathing spell, the speed of speech, feeling, etc. The phoneme elements of each phoneme in an actual speech vary, depending on the phonetic context. Thus, if the synthesis unit of each of clusters relating to the phonetic context is stored, a natural speech can be synthesized in consideration of the influence of the phonetic context.
As has been described above, in the text-to-speech synthesis, it is necessary to synthesize a speech by altering the pitch and duration of each synthesis unit to predetermined values. Owing to the alternation of the pitch and duration, the quality of the synthesized speech becomes slightly lower than the quality of the speech signal from which the synthesis unit was sifted out.
On the other hand, in the case of the COC, the clustering is performed on the basis of only the distance between speech segments. Thus, the effect of variation in pitch and duration is not considered at all at the time of synthesis. As a result, the COC and the synthesis units of each cluster are not necessarily proper in the level of a synthesized speech obtained by actually altering the pitch and duration.
An object of the present invention is to provide a speech synthesis method capable of efficiently enhancing the quality of a synthesis speech generated by text-to-speech synthesis.
Another object of the invention is to provide a speech synthesis method suitable for obtaining a high-quality synthesis speech in text-to-speech synthesis.
Still another object of the invention is to provide a speech synthesis method capable of obtaining a synthesis speech with a less spectral distortion due to alternation of a basic frequency.
The present invention provides a speech synthesis method wherein synthesis units, which will have less distortion with respect to a natural speech when they become a synthesis speech, are generated in consideration of influence of alteration of a pitch or a duration, and a speech is synthesized by using the synthesis units, thereby generating a synthesis speech close to a natural speech.
According to a first aspect of the invention, there is provided a speech synthesis method comprising the steps of: generating a plurality of synthesis speech segments by changing at least one of a pitch and a duration of each of a plurality of second speech segments in accordance with at least one of a pitch and a duration of each of a plurality of first speech segments; selecting a plurality of synthesis units from the second speech segments on the basis of a distance between the synthesis speech segments and the first speech segments; and generating a synthesis speech by selecting predetermined synthesis units from the synthesis units and connecting the predetermined synthesis units to one another to generate a synthesis speech.
The first and second speech segments are extracted from a speech signal as speech synthesis units such as CV, VCV and CVC. The speech segments represent extracted waves or parameter strings extracted from the waves by some method. The first speech segments are used for evaluating a distortion of a synthesis speech. The second speech segments are used as candidates of synthesis units. The synthesis speech segments represent synthesis speech waves or parameter strings generated by altering at least the pitch or duration of the second speech segments.
The distortion of the synthesis speech is expressed by the distance between the synthesis speech segments and the first speech segments. Thus, the speech segments, which reduce the distance or distortion, are selected from the second speech segments and stored as synthesis units. Predetermined synthesis units are selected from the synthesis units and are connected to generate a high-quality synthesis speech close to a natural speech.
According to a second aspect of the invention, there is provided a speech synthesis method comprising the steps of: generating a plurality of synthesis speech segments by changing at least one of a pitch and a duration of each of a plurality of second speech segments in accordance with at least one of a pitch and a duration of each of a plurality of first speech segments; selecting a plurality of synthesis speech segments using information regarding a distance between the synthesis speech segments; forming a plurality of synthesis context clusters using the information regarding the distance and the synthesis units; and generating a synthesis speech by selecting those of the synthesis units, which correspond to at least one of the phonetic context clusters which includes phonetic contexts of input phonemes, and connecting the selected synthesis units.
The phonetic contexts are factors constituting environments of speech segments. The phonetic context is a combination of factors, for example, a phoneme name, a preceding phoneme, a subsequent phoneme, a further subsequent phoneme, a pitch period, power, the presence/absence of stress, the position from accent nucleus, the time of breadth, the speed of speech, and feeling. The phonetic context cluster is a mass of phonetic contexts, for example, xe2x80x9cphoneme of segment=/ka/; preceding phoneme=/i/ or /u/; and pitch frequency=200 Hz.xe2x80x9d
According to a third aspect of the invention, there is provided a speech synthesis method comprising the steps of: generating a plurality of synthesis speech segments by changing at least one of a pitch and a duration of each of a plurality of second speech segments and a plurality of second speech segments in accordance with at least one of the pitch and duration of each of a plurality of first speech segments labeled with phonetic contexts; generating a plurality of phonetic context clusters on the basis of a distance between the synthesis speech segments and the first speech segments; selecting a plurality of synthesis units corresponding to the phonetic context clusters from the second speech segments on the basis of the distance; and generating a synthesis speech by selecting those of the synthesis units, which correspond to the phonetic context clusters including phonetic contexts of input phonemes, and connecting the selected synthesis units.
According to the first to third aspects, the synthesis speech segments are generated and then spectrum-shaped. The spectrum-shaping is a process for synthesizing a xe2x80x9cmodulatedxe2x80x9d clear speech and is achieved by, e.g. filtering by means of a adaptive post-filter for performing formant emphasis or pitch emphasis.
In this way, the speech synthesized by connecting the synthesis units is spectrum-shaped, and the synthesis speech segments are similarly spectrum-shaped, thereby generating the synthesis units, which will have less distortion with respect to a natural speech when they become a final synthesis speech after spectrum shaping. Thus, a xe2x80x9cmodulatedxe2x80x9d clearer synthesis speech is obtained.
In the present invention, speech source signals and information on combinations of coefficients of a synthesis filter for receiving the speech source signals and generating a synthesis speech signal may be stored as synthesis units. In this case, if the speech source signals and the coefficients of the synthesis filter are quantized and the quantized speech source signals and information on combinations of the coefficients of the synthesis filter are stored, the number of speech source signals and coefficients of the synthesis filter, which are stored as synthesis units, can be reduced. Accordingly, the calculation time needed for learning synthesis units is reduced and the memory capacity needed for actual speech synthesis is decreased.
Moreover, at least one of the number of the speech source signals stored as the synthesis units and the number of the coefficients of the synthesis filter stored as the synthesis units can be made less than the total number of speech synthesis units or the total number of phonetic context clusters. Thereby, a high-quality synthesis speech can be obtained.
According to a fourth aspect of the invention, there is provided a speech synthesis method comprising the steps of: prestoring information on a plurality of speech synthesis units including at least speech spectrum parameters; selecting predetermined information from the stored information on the speech synthesis units; generating a synthesis speech signal by connecting the selected predetermined information; and emphasizing a formant of the synthesis speech signal by a formant emphasis filter whose filtering coefficient is determined in accordance with the spectrum parameters of the selected information.
According to a fifth aspect of the invention, there is provided a speech synthesis method comprising the steps of: generating linear prediction coefficients by subjecting a reference speech signal to a linear prediction analysis; producing a residual pitch wave from a typical speech pitch wave extracted from the reference speech signal, using the linear prediction coefficients; storing information regarding the residual pitch wave as information of a speech synthesis unit in a voiced period; and synthesizing a speech, using the information of the speech synthesis unit.
According to a sixth aspect of the invention, there is provided a speech synthesis method comprising the steps of: storing information on a residual pitch wave generated from a reference speech signal and a spectrum parameter extracted from the reference speech signal; driving a vocal tract filter having the spectrum parameter as a filtering coefficient, by a voiced speech source signal generated by using the information on the residual pitch wave in a voiced period, and by an unvoiced speech source signal in an unvoiced period, thereby generating a synthesis speech; and generating the residual pitch wave from a typical speech pitch wave extracted from the reference speech signal, by using a linear prediction coefficient obtained by subjecting the reference speech signal to linear prediction analysis.
A speech synthesis apparatus shown in FIG. 1, according to a first embodiment of the present invention, mainly comprises a synthesis unit training section 1 and a speech synthesis section 2. It is the speech synthesis section 2 that actually operates in text-to-speech synthesis. The speech synthesis is also called xe2x80x9cspeech synthesis by rule.xe2x80x9d The synthesis unit training section 1 performs learning in advance and generates synthesis units.
More specifically, the residual pitch wave can be generated by filtering the speech pitch wave through a linear prediction inverse filter whose characteristics are determined by a linear prediction coefficient.
In this context, the typical speech pitch wave refers to a non-periodic wave extracted from a reference speech signal so as to reflect spectrum envelope information of a quasi-periodic speech signal wave. The spectrum parameter refers to a parameter representing a spectrum or a spectrum envelope of a reference speech signal. Specifically, the spectrum parameter is an LPC coefficient, an LSP coefficient, a PARCOR coefficient, or a kepstrum coefficient.
If the residual pitch wave is generated by using the linear prediction coefficient from the typical speech pitch wave extracted from the reference speech signal, the spectrum of the residual pitch wave is complementary to the spectrum of the linear prediction coefficient in the vicinity of the formant frequency of the spectrum of the linear prediction coefficient. As a result, the spectrum of the voiced speech source signal generated by using the information on the residual pitch wave is emphasized near the formant frequency.
Accordingly, even if the spectrum of a voiced speech source signal departs from the peak of the spectrum of the linear prediction coefficient due to change of the fundamental frequency of the synthesis speech signal with respect to the reference speech signal, a spectrum distortion is reduced, which will make the amplitude of the synthesis speech signal extremely smaller than that of the reference speech signal at the formant frequency. In other words, a synthesis speech with a less spectrum distortion due to change of fundamental frequency can be obtained.
In particular, if pitch synchronous linear prediction analysis synchronized with the pitch of the reference speech signal is adopted as linear prediction analysis for reference speech signal, the spectrum width of the spectrum envelope of the linear prediction coefficient becomes relatively large at the formant frequency. Accordingly, even if the spectrum of a voiced speech source signal departs from the peak of the spectrum of the linear prediction coefficient due to change of the fundamental frequency of the synthesis speech signal with respect to the reference speech signal, a spectrum distortion is similarly reduced, which will make the amplitude of the synthesis speech signal extremely smaller than that of the reference speech signal at the formant frequency.
Furthermore, in the present invention, a code obtained by compression-encoding a residual pitch wave may be stored as information on the residual pitch wave, and the code may be decoded for speech synthesis. Thereby, the memory capacity needed for storing information on the residual pitch wave can be reduced, and a great deal of residual pitch wave information can be stored with a limited memory capacity. For example, inter-frame prediction encoding can be adopted as compression-encoding.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.