The present invention relates generally to speech synthesis. More particularly, the invention relates to a system and method for personalizing the output of the speech synthesizer to resemble or mimic the nuances of a particular speaker after enrollment data has been supplied by that speaker.
In many applications using text-to-speech (TTS) synthesizers, it would be desirable to have the output voice of the synthesizer resemble the characteristics of a particular speaker. Much of the effort spent in developing speech synthesizers today has been on making the synthesized voice sound as human as possible. While strides continue to be made in this regard, the present day synthesizers produce a quasi-natural speech sound that represents an amalgam of the allophones contained within the corpus of speech data used to construct the synthesizer. Currently, there is no effective way of producing a speech synthesizer that mimics the characteristics of a particular speaker, short of having that speaker spend hours recording examples of his or her speech to be used to construct the synthesizer. While it would be highly desirable to be able to customize or personalize an existing speech synthesizer using only a small amount of enrollment data from a particular speaker, that technology has not heretofore existed.
Most present day speech synthesizers are designed to convert information, typically in the form of text, into synthesized speech. Usually, these synthesizers are based on a synthesis method and associated set of synthesis parameters. The synthesis parameters are usually generated by manipulating concatenation units of actual human speech that has been pre-recorded, digitized, and segmented so that the individual allophones contained in that speech can be associated with, or labeled to correspond to, the text used during recording. Although there are a variety of different synthesis methods in popular use today, one illustrative example is the source-filter synthesis method. The source-filter method models human speech as a collection of source waveforms that are fed through a collection of filters. The source waveform can be a simple pulse or sinusoidal waveform, or a more complex, harmonically rich waveform. The filters modify and color the source waveforms to mimic the sound of articulated speech.
In a source-filter synthesis method, there is generally an inverse correlation between the complexity of the source waveform and the filter characteristics. If a complex waveform is used, usually a fairly simple filter model will suffice. Conversely, if a simple source waveform is used, typically a more complex filter structure is used. There are examples of speech synthesizers that have exploited the full spectrum of source-filter relationships, ranging from simple source, complex filter to complex source, simple filter. For purposes of explaining the principles of the invention, a glottal source, formant trajectory filter synthesis method will be illustrated here. Those skilled in the art will recognize that this is merely exemplary of one possible source-filter synthesis method; there are numerous others with which the invention may also be employed. Moreover, while a source-filter synthesis method has been illustrated here, other synthesis methods, including non-source-filter methods are also within the scope of the invention.
In accordance with the invention, a personalized speech synthesizer may be constructed by providing a base synthesizer employing a predetermined synthesis method and having an initial set of parameters used by that synthesis method to generate synthesized speech. Enrollment data is obtained from a speaker, and that enrollment data is used to modify the initial set of parameters to thereby personalize the base synthesizer to mimic speech qualities of the speaker.
In accordance with another aspect of the invention, the initial set of parameters may be decomposed into speaker dependent parameters and speaker independent parameters. The enrollment data obtained from the new speaker is then used to adapt the speaker dependent parameters and the resulting adapted speaker dependent parameters are then combined with the speaker independent parameters to generate a set of personalized synthesis parameters for use by the speech synthesizer.
In accordance with yet another aspect of the invention, the previously described speaker dependent parameters and speaker independent parameters may be obtained by decomposing the initial set of parameters into two groups: context independent parameters and context dependent parameters. In this regard, parameters are deemed context independent or context dependent, depending on whether there is detectable variability within the parameters in different contexts. When a given allophone sounds differently, depending on what neighboring allophones are present, the synthesis parameters associated with that allophone are decomposed into identifiable context dependent parameters (those that change depending on neighboring allophones). The allophone is also decomposed into context independent parameters that do not change significantly when neighboring allophones are changed.
The present invention associates the context independent parameters with speaker dependent parameters; it associates context dependent parameters with speaker independent parameters. Thus, the enrollment data is used to adapt the context independent parameters, which are the re-combined with the context dependent parameters to form the adapted synthesis parameters. In the preferred embodiment, the decomposition into context independent and context dependent parameters results in a smaller number of independent parameters than dependent ones. This difference in number of parameters is exploited because only the context independent parameters (fewer in number) undergo the adaptation process. Excellent personalization results are thus obtained with minimal computational burden.
In yet another aspect of the invention, the adaptation process discussed above may be performed using a very small amount of enrollment data. Indeed, the enrollment data does not even need to include examples of all context independent parameters. The adaptation process is performed using minimal data by exploiting an eigenvoice technique developed by the assignee of the present invention. The eigenvoice technique involves using the context independent parameters to construct supervectors that are then subjected to a dimensionality reduction process, such as principle component analysis (PCA) to generate an eigenspace. The eigenspace represents, with comparatively few dimensions, the space spanned by all context independent parameters in the original speech synthesizer. Once generated, the eigenspace can be used to estimate the context independent parameters of a new speaker by using even a short sample of that new speaker's speech. The new speaker utters a quantity of enrollment speech that is digitized, segmented, and labeled to constitute the enrollment data. The context independent parameters are extracted from that enrollment data and the likelihood of these extracted parameters is maximized given the constraint of the eigenspace.
The eigenvoice technique permits the system to estimate all of the new speaker's context independent parameters, even if the new speaker has not provided a sufficient quantity of speech to contain all of the context independent parameters. This is possible because the eigenspace is initially constructed from the context independent parameters from a number of speakers. When the new speaker's enrollment data is constrained within the eigenspace (using whatever incomplete set of parameters happens to be available) the system infers the missing parameters to be those corresponding to the new speaker's location within the eigenspace.
The techniques employed by the invention may be applied to virtually any aspect of the synthesis method. A presently preferred embodiment applies the technique to the formant trajectories associated with the filters of the source-filter model. That technique may also be applied to speaker dependent parameters associated with the source representation or associated with other speech model parameters, including prosody parameters, including duration and tilt. Moreover, if the eigenvoice technique is used, it may be deployed in an iterative arrangement, whereby the eigenspace is trained iteratively and thereby improved as additional enrollment data is supplied.
For a more complete understanding of the invention, its objects and advantages, refer to the following description and to the accompanying drawings.