This invention relates to a speech synthesizing method and apparatus and, more particularly, to a speech synthesizing method and apparatus for controlling the power of synthesized speech.
A conventional speech synthesizing method that is available for obtaining desired synthesized speech involves dividing a pre-recorded phoneme unit into a plurality of sub-phoneme units and subjecting the sub-phoneme units obtained as a result to processing such as interval modification, repetition and thinning out to thereby obtain a composite sound having a desired duration and fundamental frequency.
FIGS. 5A to 5D are diagrams schematically illustrating a method of dividing a speech waveform into sub-phoneme units. A speech waveform shown in FIG. 5A is divided into sub-phoneme units of the kind illustrated in FIG. 5C using an extracting window function of the kind shown in FIG. 5B. Here an extracting window function synchronized to the pitch interval of original speech is applied to the portion of the waveform that is voiced (the latter half of the speech waveform), and an extracting window function having an appropriate interval is applied to the portion of the waveform that is unvoiced.
The duration of synthesized speech can be shortened by thinning out and then using these sub-phoneme units obtained by the window function. The duration of synthesized speech can be lengthened, on the other hand, by using these sub-phoneme units repeatedly.
By reducing the interval of the sub-phoneme units in the voiced portion, it is possible to raise the fundamental frequency of synthesized speech. Widening the interval of the sub-phoneme units, on the other hand, makes it possible to lower the fundamental frequency of synthesized speech.
Desired synthesized speech of the kind indicated in FIG. 5D is obtained by superposing the sub-phoneme units again after the repetition, thinning out and interval modification described above.
Control of the power of synthesized speech is performed in the following manner: In a case where phoneme average power p0 serving as a target is given, average power p of synthesized speech obtained through the above-described procedure is determined and synthesized speech obtained through the above-described procedure is multiplied by √{square root over (p0/p)} to thereby obtain synthesized speech having the desired average power. It should be noted that power is defined as the square of the amplitude or as a value obtained by integrating the square of the amplitude over a suitable interval. The volume of a composite sound is large if the power is large and small if the power is small.
FIGS. 6A to 6E are diagrams useful in describing ordinary control of the power of synthesized speech. The speech waveform, extracting window function, sub-phoneme units and synthesized waveform of in FIGS. 6A to 6D correspond to those of FIGS. 5A to 5D, respectively. FIG. 6E illustrates power-controlled synthesized speech obtained by multiplying the synthesized waveform of FIG. 6D by √{square root over (p0/p)}.
With the method of power control described above, however, unvoiced portions and voiced portions are enlarged by the same magnification and, as a result, there are instances where the unvoiced portions develop abnormal noise-like sounds. This leads to a decline in the quality of synthesized speech.