The present invention relates generally to text-to-speech synthesis. More particularly, the invention relates to a method for personalizing a synthesizer and for developing a database of speech units for use by a text-to-speech synthesizer.
Text-to-speech synthesis systems convert an input string of text into synthesized speech using speech modeling parameters or digitally sampled concatenative sound units to generate data strings that are played back through an audio system to mimic the sound of human speech. The model parameters or concatenative units are usually developed or trained in advance using recordings of actual human speech as the starting point. The model parameters or concatenative units, however, allow a very limited mimic of the sound of human speech based on the training which typically utilizes recordings from one individual.
Developing a sufficiently rich body of spoken text can be very time-consuming and expensive. Examples of actual human speech need to be recorded and labeled; and the resulting set of recordings needs to include at least one instance of every speech unit type needed for synthesis of all attested phoneme strings in the target language. This means, for example, that in a diphone synthesizer, the database must contain recorded examples of every allowed sequence of two allophones. Because data collection and analysis involves significant labor, it is desirable to minimize the size of the database. Ideally this means that one wants to collect the smallest set of utterances containing the desired material. However, in planning the recording sessions it is also necessary to consider other factors. Many unit types may contain different pronunciations, based on phonemes adjacent to the ones they contain. If the resulting synthesizer is to reproduce these effects, then all such variants must be attested.
For example, in the English language the diphone sequence /kae/ is pronounced differently in xe2x80x9ccatxe2x80x9d than in xe2x80x9ccanxe2x80x9d, due to the nasalizing effects of the following /n/ in the latter word. A high quality synthesizer must contain examples of both types of /kae/.
In addition to variations due to adjacent phonemes, other variations may be attributed to syllable boundaries and word boundaries. Moreover, some contexts may simply produce better sound units than others. For example, sound units taken from secondary stressed syllables can be used to synthesize both secondary and primary stressed syllables. The converse is not necessarily true. Thus sound units taken from context which have primary stress in the original utterance may only be useable for synthesizing syllables which also have primary stress. Finally, synthesis developers may find that certain types of utterances produce better sound units than others. For example, when human speakers read simple words in isolation, the recordings often do not produce good sound units for synthesis. Similarly, very long sentences may also be problematic. Therefore complex words and short phrases are preferred.
The task of assembling a collection of suitable text words and phrases for use in a synthesis database recording session has heretofore been daunting, to say the least. Most developers will compile a collection of sentences and words for the preselected speakers to read and this collection is usually quite a bit larger than would actually be needed if one analyzed the text requirements in a systematic way. The result of collecting suitable text words and phrases based on preselected speakers is a limited ability to produce the synthesized speech. Although the synthesized speech mimics the sound of human speech, the range of qualities of the sound is limited to a great extent depending on the speakers. Most synthesis system designers have approached the problem more as an art than as a science and that yields a limited ability to produce mimicked speech personalized to sound similar to a particular human.
The present invention seeks to formalize the development of recorded content for text-to-speech synthesis through a set of procedures which, if followed, produce a minimal recording text list which contains all necessary unit types for a given language, with all desired variants of each, from optimal contexts in optimal types of utterances. The invention further seeks to personalize the synthesized speech to more closely mimic a particular speaker based on the minimal recording text list.
The personalizer represents one important aspect of the invention in which an original set of recorded sound units, stored as allophones, diphones and/or triphones (generally referred to here as snippets) in a database, are compared with the sound units of a new speaker or target speaker. In a preferred embodiment, allophones from different contexts are compared with allophones from the original set of recorded sound units. This is done by acoustic alignment of the respective allophones, followed by a closeness comparison. The closeness comparison may be performed using the same components as are used for automatic speech recognition.
When the comparison is performed, some allophones from the recorded set and from the new speaker will be sufficiently close, acoustically, so that no modification of those allophones is required. However, other allophones may differ substantially between the originally recorded set and the new target speaker. The personalizer employs a threshold comparison system to separate the allophones that are acoustically close from those that are not. The personalizer then focuses on the allophones that are not acoustically close. These xe2x80x9cfarxe2x80x9d allophones will be altered to make the synthesizer sound more like the target speaker.
The set of xe2x80x9cfarxe2x80x9d allophones can be compared against a source of text using an exhaustive search algorithm, to identify all passages of text that contain representative examples of the xe2x80x9cfarxe2x80x9d allophones. However, the presently preferred embodiment uses a greedy selection algorithm to identify passages of text that best represent the xe2x80x9cfarxe2x80x9d allophones. The greedy selection algorithm thus generates a customized training text which the target speaker then reads while the system captures examples of that speaker""s xe2x80x9cfarxe2x80x9d allophones. Once examples of the xe2x80x9cfarxe2x80x9d allophones have been collected, they are substituted for those of the original set, or are otherwise used to transform the sound units used by the synthesizer, so that the synthesizer will now sound like the target speaker.
The target speaker utters each allophone in a given context, such as a neutral context (e.g. the vowel surrounded by letters xe2x80x98txe2x80x99 or xe2x80x98sxe2x80x99). Using knowledge of the target speaker""s allophones in this given context, the system determines which allophones are xe2x80x9cfarxe2x80x9d from those of the synthesizer. While it is possible to simply substitute these known xe2x80x9cfarxe2x80x9d allophones for those of the synthesizer, there typically will remain many other contexts of that allophone for which the system has no uttered data from the target speaker. Therefore, to develop a richer representation of the target speaker""s allophones, the system determines what additional contexts or environments are needed to develop a complete assessment of the allophone in question and generates additional text for the target speaker to read. The generated text is specifically designed using the greedy algorithm to optimally obtain examples of the allophones in question from other contexts. In this way the xe2x80x9cfarxe2x80x9d allophones may be pulled closer to those of the target speaker across all contexts.
The additional contexts are selected by rules designed to group or cluster contexts into related classes. In designing the system, related classes of contexts are determined by analyzing the data from the original synthesizer and then making the assumption that all speakers (including the target speaker) would have the same classes. For example, the data may show that the letter xe2x80x98axe2x80x99 in the context of adjacent fricatives will all behave in acoustically the same way and would thus be clustered together. To do this a closeness metric may be applied, such as the closeness metric defined for triphones in developing the original synthesizer. Such a metric would xe2x80x9creach overxe2x80x9d the vowels and thus xe2x80x9csensexe2x80x9d the context influence. This information would be used to cluster vowels into groups that are influenced in similar ways by a given context.
Although the preferred embodiment originally collects neutral context allophones from the target speaker, the final synthesizer product may be based on snippets comprising sound units of different sizes, including diphones, triphones and allophones in various contexts. In theory, the neutral context allophones of the target speaker that are sufficiently close to the original synthesizer do not have to be trained further. The same holds true for larger sound units such as diphones and triphones that contain these xe2x80x9cclosexe2x80x9d allophones. On the other hand, when neutral context allophones are discovered to be xe2x80x9cfar,xe2x80x9d related larger sound units such as diphones and triphones will also need to be corrected. The text generated by the greedy algorithm elicits speech from the target speaker to improve these larger sound units as well.
The personalization process can be performed once as described above, or many times through iteration. In the iterative approach, the target speaker reads the generated text, allophones are extracted from this speech and then processed and used to modify the synthesizer and to generate new text for reading. Then the target speaker provides additional speech samples from the new text, and a closeness comparison is again performed, and further text is generated. Each time the target speaker reads the generated text, the synthesizer and its set of sound units are more closely tuned to that speaker""s speech. The process proceeds iteratively until there are no longer any xe2x80x9cfarxe2x80x9d allophones when the closeness comparison is performed.
While implementation may vary, the presently preferred system employs a lexicon compiler/analyzer, a parser, a phoneme-to-unit utility, a closeness comparator, a required snippets selector and an optimal set selection algorithm. The lexicon compiler/analyzer produces a database of phonetically analyzed words, with their corresponding phoneme strings, including prosodic boundaries (syllable boundaries plus the stronger boundaries which occur between elements of complex words). The parser extracts phrases suitable for recording from text corpora. The phoneme-to-unit utility determines which sound units (i.e. snippets) can be extracted from a recording of each word or phrase, and what context features each would have. The phoneme-to-unit utility marks any snippets which occur in environments which make them unsuitable as sources for the speech unit database. The closeness comparator determines required snippets based on snippets selected from the text database and allophones obtained from a new speaker. The required snippets are useful in providing voice personalized data so that a unique human sound may be synthesized based on a particular user. The set selector examines the inventory of words and phrases analyzed by the preceding modules and determines a minimal subset which can contain a desired number of tokens for each unit type (defined in terms of phonemes contained in the unit as well as context features applied to them) in optimal environments. The above described modules can be implemented to perform an exhaustive search, by a greedy algorithm, or by other appropriate means.
The greedy selection algorithm used in the above personalizer may also be used upon acoustically labeled previously recorded speech, such as from transcribed speeches, books on tape, closed caption broadcasts, and the like, to generate new synthesizers or synthesizers that sound like the recorded speech. Examples of acoustically labeled recorded speech may be obtained via broadcast media or over the internet. The algorithm identifies the best or most reliable examples of recorded speechxe2x80x94those that will best represent each allophone in context. Once these allophones are identified, they may be analyzed to extract source-filter synthesis model components to construct a synthesizer. Thus, for example the identified allophones may be analyzed to extract the formant trajectories and glottal pulse information, which is then used to develop the new synthesizer.