The present invention relates generally to improvements in text to speech systems. More particularly, the invention relates to the use of speaker specific characteristics in developing durational models for text to speech systems.
A text to speech system receives text inputs, typically words and sentences, and converts these inputs into spoken words and sentences. A typical text to speech system performs text analysis to predict phone sequences, duration modeling to predict the length of each phone, intonation modeling to predict pitch contours and synthesis to combine the results of the different analyses and modules in order to create speech sounds. A significant element in any text to speech system, and the element which is addressed by the present invention, is duration modeling.
In order to construct a duration model, the system typically processes a body of training data, established by having a target speaker read a body of selected material to the system. The text to speech system analyzes the body of material in order to construct the model of the speaking style of the target speaker. By reading a substantial body of text, properly selected to include instances of each phone in all contexts in which it occurs, the target speaker is able to expose the system to a comprehensive example of his or her speaking style, so that the system can develop a model which accurately reflects the numerous different parameters which make up the speaker""s speaking style.
When a source speaker reads the training corpus to a text to speech system, the system is trained to learn simultaneously the characteristics of the language and the speaker""s individual speech characteristics. With prior art systems, no distinction is made between the language specific component and the speaker specific component. Therefore, training the system to learn a new speaker""s characteristics requires that the system repeat the entire training process for every new speaker. The training process is not shortened or simplified by the information previously provided by training the system to a prior speaker. The system must be trained anew, using a large training corpus, in order to enable the system to mimic that speaker. Proper training of a system may involve reading several hours of text to the system. The speech resulting from reading the text must be processed in order to train the system.
The need to fully train a system for each speaker represents a significant obstacle to more widespread use of prior art systems. One use for a text to speech system which could mimic the user""s own voice might include a voice email system which transmitted text and voice characteristics. The text could be converted to speech mimicking the sender""s voice at the receiving station. Transmitting text together with voice characteristics would consume considerably less bandwidth than transmitting a recording of the sender""s voice. Moreover, transmitting text together with voice characteristics would always make a sender""s voice available for sending, even in cases in which was impossible or inconvenient to make a true recording of the sender""s voice. Another application might involve wireless telephony. A near end telephone could convert speech to text and transmit this text, together with the speaker""s voice characteristics, to the far end. At the far end, the text and voice characteristics could be reconstructed to mimic the speaker""s voice. Using such a system would save considerable bandwidth and would allow transmissions which sounded like the voices of the speakers. However, the need for prior art systems to be trained for each speaker presents a significant obstacle to such uses of speaker specific systems. The typical user does not wish to spend hours reading training sentences.
In order to reduce the need to develop a new model for each new target speaker, some prior art systems have simply measured the speaking rate of the target speaker and changed the speaking rate of the source model to conform to the speaking rate of the target speaker. This process has not yielded an accurate model of the target speaker because changing the speaking rate shortens or lengthens all sound classes equally. Different speakers who share the same speaking rate typically do not have the same duration for all sound classes. Moreover, differences in the speaking rates of different speakers do not typically consist of changes reflected uniformly among all sound classes between one speaker and another. To take a simplified example, a first speaker may have a comparatively longer duration for fricatives compared to other sound classes than does a second speaker. Changing the speaking rate of the first speaker by shortening the duration of all sound classes will not replicate the speaking style of the second speaker, because fricatives will remain comparatively long compared to other sound classes under the uniform shortening process resulting from changing the speaking rate.
There exists, therefore, a need for a text to speech system which can be trained to create a duration model mimicking the speaking style of a target speaker based on differences between the target speaker and a source speaker, and which employs a relatively small training corpus to identify the differences between the target speaker and the source speaker.
In one aspect, a process of target speaker training according to the present invention advantageously comprises developing a source model using a large body of training data in order to develop a source model reflecting language specific characteristics as well as speaker specific characteristics for a source speaker. In order to develop a target model for a specific target speaker, a smaller body of training data is selected to yield information about the speaker specific characteristics of the target speaker. The body of training data is used to develop a training corpus which is then processed to produce modification parameters identifying differences between the durational characteristics of the target speaker and those of the source speaker. The modification parameters are then applied to the source model to produce a target model.
A more complete understanding of the present invention, as well as further features and advantages of the invention, will be apparent from the following Detailed Description and the accompanying drawings.