The present invention relates to a speech synthesizing method and apparatus and, more particularly, to power control on synthesized speech in a speech synthesizing process.
As a speech synthesizing method of obtaining desired synthesized speech, a method of generating synthesized speech by editing and concatenating speech segments in units of phonemes or CV/VC, VCV (C: Consonant; V: vowel), and the like is known. FIGS. 10A to 10D are views for explaining CV/VC and VCV as speech segment units. As shown in FIGS. 10A to 10D, CV/VC is a unit with a speech segment boundary set in each phoneme, and VCV is a unit with a speech segment boundary set in a vowel.
FIGS. 11A to 11D are views schematically showing an example of a method of changing the duration length and fundamental frequency of one speech segment. As shown in FIG. 11C, a speech waveform 1101 of one speech segment shown in FIG. 11A is divided into a plurality of small speech segments 1103 by a plurality of window functions 1102 in FIG. 11B. In this case, for a voiced sound portion (a voiced sound region in the second half of a speech waveform), a window function having a time width synchronous with the pitch of the original speech is used. For an unvoiced sound portion (an unvoiced sound region in the first half of the speech waveform), a window function having an appropriate time width (longer than that for a voiced sound portion) is used.
By repeating a plurality of small speech segments obtained in this manner, thinning out some of them, and changing the intervals, the duration length and fundamental frequency of synthesized speech 1104 can be changed as shown in FIG. 11D. For example, the duration length of synthesized speech can be reduced by thinning out small speech segments, and can be increased by repeating small speech segments. The fundamental frequency of synthesized speech can be increased by reducing the intervals between small speech segments of a voiced sound portion, and can be decreased by increasing the intervals between the small speech segments. By superimposing a plurality of small speech segments obtained by such repetition, thinning out, and interval changes, synthesized speech having a desired duration length and fundamental frequency can be obtained.
Power control for such synthesized speech can be performed as follows. Synthesized speech having a desired average power can be obtained by obtaining an estimated value p0 of the average power of speech segments (corresponding to a target average power) and an average power p of the synthesized speech obtained by the above procedure, and multiplying the synthesized speech obtained by the above procedure by (p/p0)1/2. That is, power control is executed in units of speech segments.
The above power control method suffers the following problems.
The first problem is associated with mismatching between a power control unit and a speech segment unit.
To perform stable power control, power control must be performed in units of periods of time with a certain length. In addition, a power variation needs to be small within a power control unit. As a unit that satisfies these conditions, a phoneme or the like may be used. However, the above unit like CV/VC or VCV has a phoneme boundary with a large variation within a speech segment, and hence the power variation is large in each speech segment. Therefore, this unit is not suitable as a power control unit.
A voiced sound portion greatly differs in power from an unvoiced sound portion. Basically, since a voiced/unvoiced sound can be uniquely determined from a phoneme type, the above difference poses no problem if the average power value of each type of phoneme is estimated. A close examination, however, reveals that there are exceptions to the relationship between phoneme types and voice/unvoiced sounds, and mismatching may occur. In addition, a phoneme boundary may differ from a voiced/unvoiced sound boundary by several msec to ten-odd msec. This is because a phoneme type and phoneme boundary are mainly determined by a vocal tract shape, whereas a voiced/unvoiced sound is determined by the presence/absence of vocal cord vibrations.
The present invention has been made in consideration of the above problems, and has as its object to perform proper power control even if a phoneme unit with power greatly varying within a speech segment is set as a unit for waveform edition.
In order to achieve the above object, according to the present invention, there is provided a speech synthesizing method comprising the division step of acquiring partial speech segments by dividing a speech segment in a predetermined unit with a phoneme boundary, the estimation step of estimating a power value of each partial speech segment obtained in the division step on the basis of a target power value, the changing step of changing the power value of each of the partial speech segments on the basis of the power value estimated in the estimation step, and the generating step of generating synthesized speech by using the partial speech segments changed in the changing step.
In order to achieve the above object, according to the present invention, there is provided a speech synthesizing apparatus comprising division means for acquiring partial speech segments by dividing a speech segment in a predetermined unit with a phoneme boundary, estimation means for estimating a power value of each partial speech segment obtained by the division means on the basis of a target power value, changing means for changing the power value of each of the partial speech segments on the basis of the power value estimated by the estimation means, and the generating means for generating synthesized speech by using the partial speech segments changed by the changing means.
Preferably, in changing the power value of each of the partial speech segments, for each of the partial speech segments, a corresponding reference power value is acquired, an amplitude change magnification is calculated on the basis of the power value estimated in the estimation step and the acquired reference power value, and a change to the estimated power value is made by changing an amplitude of the partial speech segment in accordance with the calculated amplitude change magnification. More specifically, an amplitude value of the partial speech segment is changed by using, as an amplitude change magnification, s being obtained by
s=(p/q)1/2
where p is the power value estimated in the estimation step, and q is the acquired reference power value.
Preferably, in estimating the power of each partial speech segment, whether each of the partial speech segments is a voiced or unvoiced sound is determined, and if it is determined that the partial speech segment is a voiced sound, a power value is estimated by using a parameter value for a voiced speech segment, and if it is determined that the speech segment is an unvoiced sound, a power value is estimated by using a parameter value of an unvoiced speech segment. Since parameter values suited for voiced and unvoiced sounds are used, power control can be performed more properly.
Preferably, in estimating the power value of each partial speech segment, a power estimation factor for each of the partial speech segments is acquired, and a parameter value corresponding to the acquired power estimation factor is acquired in accordance with the determination result on a voiced/unvoiced sound to estimate the power value. Preferably, the power estimation factor includes one of a phoneme type of the partial speech segment, a mora position of a synthesis target word of the partial speech segment, a mora count of the synthesis target word, and an accent type.
Preferably, a power estimation factor for a voiced sound is acquired if it is determined that the partial speech segment is a voiced sound, and a power estimation factor for an unvoiced sound is acquired if it is determined that the partial speech segment is an unvoiced sound. Since different power estimation factors can be used depending on whether a partial speech segment is a voiced or unvoiced sound, power control can be performed more properly.
Preferably, the amplitude of each partial speech segment is changed on the basis of the estimated power value and the acquired reference power value, and the reference power value corresponding to a partial speech segment of an unvoiced sound is set to relatively large. Since the amplitude magnification of a partial speech segment as an unvoiced sound can be relatively reduced, power control can be realized while high sound quality is maintained.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the figures thereof.