1. Technical Field
The present invention relates generally to speech synthesis. More particularly, the present invention relates to a speech synthesizer customization system that is able to override speech synthesis data at all hierarchical levels of a dynamic data structure.
2. Discussion
As the quality of the output of speech synthesizers continues to increase, more and more applications are beginning to incorporate synthesis technologies. For example, car navigation systems, as well as devices for the vision impaired are beginning to incorporate speech synthesizers. As the, popularity of speech synthesis increases, however, a number of limitations with regard to conventional approaches have become apparent.
A particular difficulty relates to the fact that size and development cost considerations limit the vocabulary with which conventional synthesizers are able to deal. Briefly, FIGS. 1 and 2 illustrate that the typical synthesizer will have a dynamic data structure with hierarchical levels, wherein the dynamic data structure includes a linguistic tree 20 and an acoustic tree 22. The linguistic tree 20 typically contains syntactic and linguistic objects for the sentence being synthesized, while the acoustic tree 22 holds prosodic and acoustic objects for that sentence. Thus, during synthesis of a sentence, the two hierarchical tree-like structures are xe2x80x9cbuilt upxe2x80x9d (or populated) based on the input text. It will be appreciated that usually, a tree has nodes such that a xe2x80x9cparentxe2x80x9d node has xe2x80x9cbranchesxe2x80x9d to each of its xe2x80x9cchildxe2x80x9d nodes. The linguistic tree 20 and the acoustic tree 22 are referred to as tree-like structures because, here, a parent node only has access to the first child and last child, while the rest of the children are contained in a list. Furthermore, each child has access to the corresponding parent. Nevertheless, the levels of the tree structures constitute a hierarchy.
The above tree structures and node information for a particular sentence are built up in real time by various synthesis modules, with the assistance of a fixed (or standard) database. For example, a parsing module typically generates clauses and phrases from the sentence being synthesized, while a phoneticizer uses the standard database to build up morphs and phonemes from the words in the sentence. Syllabification and allophone rules contained in the standard database generate syllables and allophones from words, morphs, and phonemes. Prosody algorithms generate prosodic phrases, prosodic words, etc. from all previous information.
As shown in FIG. 3, the standard database 24 typically therefore contains tables with information to be placed in the nodes of the trees 20, 22. This is especially true for contemporary xe2x80x9cconcatenation synthesisxe2x80x9d. It should be noted that the standard database 24 is also naturally hierarchical, since the data stored in the standard database 24 is intended to supply information for various level nodes in the dynamic trees 20, 22. Furthermore, data at higher levels of the database 24 may refer to lower level data (or vice versa). For example, information about a certain kind of phrase may refer to sequences of words and their corresponding dictionary information below. In this manner, data is shared (and memory conserved) by possible multiple references to the same data item. Roughly speaking, the standard database 24 is a relational database.
It is important to note that the above-described database 24 is designed for general unlimited synthesis, and has significant space and development cost problems. Because of these normal limitations, the size and complexity of the database 24 is typically limited. As a result, in order to tailor a given synthesizer to a particular application, it has been found that a user database is often necessary. In fact, synthesizers routinely provide xe2x80x9cuser dictionariesxe2x80x9d which are loaded into the synthesizer and are application specific. Often, markup languages allow commands to be embedded in the input text in order to alter the synthesized speech from the standard result. For example, one approach involves inserting high and low tone marks (including numeric values), into the text to indicate where, and how much to raise an intonation peak.
While the above-described conventional approaches to user databases are useful in some circumstances, a number of difficulties remain. For example, the subsequently generated speech synthesis data cannot be uniformly overridden at all hierarchical levels of the dynamic data structure. Rather, the conventional synthesizer deals with a maximum of one or two hierarchical levels, and each with different mechanisms. Furthermore, some of the hierarchical levels (such as diphone) are essentially inaccessible to text markup due to the inability to achieve the required level of granularity in linear text.
It is also important to note that conventional user database approaches are not able to override speech synthesis data within the normal synthesis sequence of computation. Imagine, for example, that we want to specify a new user supplied diphone A-B, but only if the requested stress level on A is 2 and certain kinds of allophones are found in the surrounding context of what is to be synthesized. It will be appreciated that certain conditions are only known after a complex set of allophone rules are applied (thus determining the allophone stream) and after a prosody module has selected words to de-emphasize, which in turn affects the stress level on a given phoneme. Under conventional approaches, this conditional information cannot practically be known in advance of synthesis. It is therefore virtually impossible to automatically xe2x80x9cmarkupxe2x80x9d the input text at every place where the customized diphone should be used. Simply put, user defined conditions cannot currently be based on internal states of the synthesis process, and are therefore severely limited under the traditional text markup process.
Another concern is that conventional user databases are typically not organized around the same hierarchical levels as the dynamic data structures and therefore provide inflexible control over where and what is modified during the synthesis.
The above and other objectives are provided by a speech synthesizer customization system in accordance with the present invention. The customization system has a template management tool for generating templates based on customization data from a user and replicated dynamic synthesis data from a text-to-speech (TTS) synthesizer. The replicated dynamic synthesis data is arranged in a dynamic data structure having hierarchical levels. The customization system further includes a user database that supplements a standard database of the synthesizer. The tool populates the user database with the templates such that the templates enable the user database to uniformly override subsequently generated speech synthesis data at all hierarchical levels of the dynamic data structure. The use of a tool therefore provides a mechanism for organizing, tuning, and maintaining hierarchical and multi-dimensionally sparse sets of user templates. Furthermore, providing a mechanism for uniformly overriding speech synthesis data reduces processing overhead and provides a more xe2x80x9cnaturalxe2x80x9d user database.
Further in accordance with the present invention, a user database is provided. The user database has a plurality of templates for overriding speech synthesis data of a TTS synthesizer. The speech synthesis data is arranged in a dynamic data structure having hierarchical levels. The user database further includes a hierarchical data structure organizing the templates such that the templates enable the user database to uniformly override subsequent generated speech synthesis data at all hierarchical levels of the dynamic data structure.
In another aspect of the invention, a method for customizing a synthesizer is provided. The method includes the step of generating templates based on customization data from a user and associated replicated dynamic synthesis data from the synthesizer. A standard database of the synthesizer is supplemented with a user database. The method further provides for populating the user database with the templates such that the templates enable the user database to uniformly override subsequently generated speech synthesis data at a plurality of a hierarchical levels in the dynamic data structure.