1. Field of the Invention
The present invention relates to a speech segment preparing method, speech synthesizing method, and apparatus thereof, applicable in telephone inquiry service, speech information guide system, speech rule synthesizing apparatus for personal computer, and the like.
2. Related Art of the Invention
A speech rule synthesizing technology for converting a text into speech can be utilized, for example, for hearing an explanation or an electronic mail while doing other task in a personal computer or the like, or hearing and proof-reading a manuscript written by a word processor. Moreover, by incorporating an interface using speech synthesis into a device such as electronic book, the text stored in a floppy disk, CD-ROM or the like can be read without using liquid crystal display or the like.
The speech synthesizing apparatus used for such purposes is required to be small and inexpensive. Hitherto, for such application, the parameter synthesizing method, compressed recording and reproducing method, and others have been used, but in the conventional speech synthesizing method, since special hardware such as DSP (digital signal processor) or memory of large capacity is used, applications for such uses have been rarely attempted.
To convert a text into speech, there are a method of making a rule of a chain of phonemes by a model, and synthesizing while varying the parameters by the rule according to an objective text, and a method of analyzing the speech in a small phoneme chain unit such as CV unit and VCV unit (C standing for a consonant, and V for a vowel), collecting all necessary phoneme chains from actual speech to stored as segments, and synthesizing by connecting the segments according to an objective text. Herein, the former is called the parameter synthesizing method, and the latter is the connection synthesizing method.
A representative parameter synthesizing method is the formant synthesizing method. This is a method of separating the speech forming process into a speech source model of vocal cord vibration and transmission function model of vocal tract, and synthesizing the desired speech by parameter time change of the two models. A representative parameter used in the formant synthesizing method is the peak position on the frequency axis of the speech vibration called formant. These parameters are generated by using the rule based on the phonetic findings, and the table storing the representative values of the parameters.
The parameter synthesizing method is high in the computational cost such as calculation of vocal tract transmission function, and the DSP or the like is indispensable for real-time synthesis. For parameter control, however, multitudinous rules are related, and the speech quality improvement is difficult. On the other hand, the table and rules are small in data quantity, and hence a small memory capacity is sufficient.
By contrast, the connection synthesizing method is available in the following two types depending on the format of memory of segments. That is, the parameter connection method of converting the segments into PARCOR coefficients or LSP parameters by using the speech model, and the waveform connection method of accumulating the speech waveforms directly without using speech model are known.
In the parameter connection method, the speech is segmented in small units of CV syllable, CVC, VCV (C standing for a consonant, and V for a vowel), etc., and converted into parameters such as PARCOR coefficients to be accumulated in the memory, and is reproduced as required, in which the memory format is the speech parameter, and therefore the pitch or time length can be changed easily when synthesizing, so that the segments can be connected smoothly. Besides, the required memory capacity is relatively small. A shortcoming is, however, that the calculation processing amount for synthesizing is relatively large. It, hence, requires an exclusive hardware such as DSP (digital signal processor). Yet, since the speech modeling is not sufficient, there is a limit in the sound quality of the speech reproduced from the parameters.
As the waveform connection method, on the other hand, the method of accumulating the speech directly in the memory, and the method of compressing and coding the speech to be accumulated in the memory, and reproducing when necessary are known, among others, and for compressive coding, .mu.-Law coding, ADPCM, and others are used, and it is possible to synthesize the speech at higher fidelity than in the parameter connection method.
When the contents of the speech to be synthesized are limited to few variety, it may be recorded in the sentence unit, syllable unit, or word unit, and edited properly. For synthesizing an arbitrary text, however, it is required to accumulate in further small speech segments, same as in the parameter connection method. Different from the parameter synthesis, it is difficult to change the pitch or time length, and therefore for synthesis of high quality, segments having various pitches and time lengths must be prepared.
Hence, the memory capacity of each segment is more than ten times that of the parameter connection method, and a further larger memory capacity is needed if a high quality is desired. Factors for increasing the memory capacity are dominated by the complicatedness of the phoneme chain units used in segments, and the preparation of segments in consideration of variation of pitch and time length.
As the phoneme chain unit, as mentioned above, the CV unit or VCV unit may be considered. The CV unit is a unit of combination of a pair of consonant and vowel corresponding to one syllable of the Japanese language. The CV unit is available in 130 types of combination, assuming 26 consonants and 5 vowels. In the connection of CV units, since a continuous waveform change from a preceding vowel to a consonant cannot be expressed, the naturalness is sacrificed. It is the VCV unit that is a unit including a preceding vowel of a CV unit. Hence, the VCV unit is available in 650 types, five times more than in the CV unit.
Concerning the pitch and time length, in the waveform connection method, different from the parameter connection method, it is difficult to change the pitch and time length of segments once prepared. Accordingly, segments must be prepared including variations, from the speech uttered at various pitches and time lengths beforehand, which gives rise to increase of the memory capacity.
Thus, a large memory capacity is required for synthesizing speech at high quality by the waveform connection method, and a large memory capacity several times to scores of times more than in the parameter synthesizing method is needed. In principle, however, a speech of an extremely high quality can be synthesized by using a memory device of a large capacity.
Therefore, the waveform connection method is superior in speech synthesizing method of high quality, but the problems are that the intrinsic pitch and time length of speech segment cannot be controlled, and that a memory device of large capacity is needed.
To solve these problems, a PSOLA (Pitch Synchronous Overlap Add) method is proposed (Japanese Patent Publication No. 3-501896), in which the speech waveform is cut out at window function in synchronism with the pitch, and overlapped to a desired pitch period when synthesizing.
The cut-out position in this method has the peak of the excitation pulse by closure of the glottis in the center of the window function. The shape of the window function should attenuate to 0 at both ends (for example, Hanning window). The window length is twice as long as the synthesized pitch period when the synthesized pitch period is shorter than the original pitch period of the speech waveform, and twice the original pitch period, to the contrary, when the synthesized pitch period is longer. The time length can be also controlled by decimating or repeating the cut-out pitch waveform.
As a result, from one speech segment, a waveform of arbitrary pitch and time length can be synthesized, so that a synthesized sound of high quality can be obtained by a small memory capacity.
In this method, however, the problem is that the quantity of calculation is large when synthesizing the speech. It is because it is necessary to cut out the pitch waveform by using window function when synthesizing, and calculation of trigonometric function and multiplication are performed frequently.
For example, operations necessary for synthesizing one sample of waveform include the follows. To generate one sample of pitch waveform, the memory is read out once for reading out the speech segment, the calculation of trigonometric function necessary for calculation of the Hanning window function is once and the addition is once (for giving a direct-current offset to the trigonometric function), the multiplication for calculating the angle to be given to the trigonometric function is once, and the multiplication for applying window to the speech waveform by using the value of trigonometric function is once. Since a synthesized waveform is produced by overlapping two pitch waveforms, one sample of synthesized waveform requires two times of memory access, two times of calculation of trigonometric function, four times of multiplication, and three times of addition (see FIG. 19).
Incidentally, to prevent increase of phoneme chain unit, a hybrid method is proposed (Japanese Patent Application No. 6-050890). In this method, basically, segments are composed of CV units only, and the waveform varying portion from vowel to consonant is generated by parameter synthesizing method. Therefore, the variety of phoneme chain unit is about 130 types, and the operation rate of the parameter synthesizing portion can be lowered, so that the calculation cost can be suppressed low as compared with the pure parameter synthesizing method.
In the hybrid method, however, the calculation cost of the parameter synthesizing portion is high. Furthermore, in the case of real-time parameter synthesis or high changing speed of the parameters, harmful noise may be caused due to effects of calculation precision or transient characteristic effect of synthesis transmission function (so-called filter). Accordingly, plopping, cracking or other unusual sound may be generated in the midst of synthesized sound, and the sound quality deteriorates.