The present invention relates generally to text-to-speech conversion systems and more particularly to the xe2x80x9ctrainingxe2x80x9d of such systems.
In concatenative speech synthesis systems, small portions of natural speech are spliced together to form synthetic speech waveforms. Each of the portions of original speech has associated with it the original prosody (pitch and duration) contour that was uttered by the speaker. However, when small portions of natural speech arising from different utterances in the database are concatenated, the resulting synthetic speech does not tend to have natural-sounding prosody (i.e., pitch, which is instrumental in the perception of intonation and stress in a word).
A typical approach for combating this problem involves specifying a desired prosodic contour and then either to impose this contour on the synthetic speech using digital signal processing techniques or to select segments whose prosody is naturally close to that contour. In this connection, a set of training data (i.e., speech utterances) is collected to provide the set of segments available for concatenation, as well as the from which to infer the model of prosodic variation used to specify the desired prosodic contour. Typically, those data are provided by a single speaker. However, it has been found that the collection of such data from a single speaker imposes significant limitations on the subsequent efficacy of the text-to-speech system involved.
A need has thus been recognized in connection with facilitating the enrollment of training data for a speech-to-text system in a manner that overcomes the disadvantages and shortcomings of conventional efforts in this regard.
In accordance with at least one presently preferred embodiment of the present invention, multiple speakers are utilized in obtaining training data. Further, this will preferably involve suitable normalization of the data from each speaker to transform that data to mimic a canonical target speaker. For example, in building a prosodic model, the pitch values for a given utterance are divided by the average pitch over that utterance, yielding relative pitches which are comparable across multiple speakers; a value less than one implies a lowering of the pitch during that portion of the utterance while a value greater than one implies an elevation in pitch.
Broadly contemplated in accordance with at least one embodiment of the present invention are significant differences in comparison with some conventional efforts, in which the user is able to choose from several available voices, such as a man, woman, or child. In that case, completely separate systems are built, each of which relies on training data from a single speaker, i.e. the target voice. A switch may then be used to select one of the systems. However, in accordance with at least one embodiment of the present invention, a single system is built which relies on data from multiple speakers.
In one aspect, the present invention provides a method of constructing a model for use in a text-to-speech synthesis system, the method comprising the steps of obtaining a set of features and a first corresponding observation value from a first training speaker; obtaining the set of features and a second corresponding observation value from a second training speaker; and pooling the first and second corresponding observation values to obtain the model.
In another aspect, the present invention provides a method of constructing a model for use in a text-to-speech synthesis system, the method comprising the steps of: obtaining a set of features and a corresponding observation value from a first training speaker; repeating the step of obtaining a set of features and a corresponding observation value for each of a plurality of additional speakers; and pooling the corresponding observation values, from the first speaker and the additional speakers, to obtain the model.
In an additional aspect, the present invention provides a method for enrolling training data for a text-to-speech synthesis system, the method comprising the steps of: collecting speech data from at least two speakers; ascertaining at least one characteristic relating to the speech data of each speaker; and creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
In a further aspect, the present invention provides an apparatus for constructing a model for use in a text-to-speech synthesis system, the apparatus comprising: an obtaining arrangement which obtains a set of features and a first corresponding observation value from a first training speaker; the obtaining arrangement being adapted to obtain the set of features and a second corresponding observation value from a second training speaker; and a pooling arrangement which pools the first and second corresponding observation values to obtain the model.
In another aspect, the present invention provides an apparatus for constructing a model for use in a text-to-speech synthesis system, the apparatus comprising: an obtaining arrangement which obtains a set of features and a corresponding observation value from a first training speaker; the obtaining arrangement being adapted to further obtain a set of features and a corresponding observation value for each of a plurality of additional speakers; and a pooling arrangement which pools the corresponding observation values, from the first speaker and the additional speakers, to obtain the model.
In an additional aspect, the present invention provides an apparatus for enrolling training data for a text-to-speech synthesis system, the apparatus comprising: a collector arrangement which collects speech data from at least two speakers; an ascertaining arrangement which ascertains at least one characteristic relating to the speech data of each speaker, and a target range creator which creates a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
In a further aspect, the present invention provides a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for constructing a model for use in a text-to-speech synthesis system, the method comprising the steps of: obtaining a set of features and a first corresponding observation value from a first training speaker; obtaining the set of features and a second corresponding observation value from a second training speaker; and pooling the first and second corresponding observation values to obtain the model.
Furthermore, in an additional aspect, the present invention provides a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for enrolling training data for a text-to-speech synthesis system, the method comprising the steps of collecting speech data from at least two speakers; ascertaining at least one characteristic relating to the speech data of each speaker; and creating a target range of speech data via transforming the at least one characteristic relating to the speech data of each speaker.
For a better understanding of the present invention, together with other and further features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying drawings, and the scope of the invention will be pointed out in the appended claims.