As shown in FIG. 1 numeral 100, to convert text to speech, text-to-speech synthesizer systems typically use the following four-step process. Step 1 is the tokenization step, in which a text stream (101) is tokenized into text tokens by a text tokenizer (102). Step 2 is the lexicon access step, in which each text token is looked up in a lexicon (104) by a lexicon accessor (103). The lexicon consists of a static lexicon (105) that contains pronunciations for specified words, and a dynamic lexicon (106) that contains a procedure for generating pronunciations for the words that are not stored in the static lexicon. Because some words (e.g., "live") have more than one pronunciation, the lexicon access step will result in at least one pronunciation token being retrieved corresponding to each text token. Step 3 is the disambiguation step, in which all pronunciation ambiguities are resolved by a disambiguator (107), resulting in a one-to-one mapping between text tokens and pronunciation tokens. Finally, Step 4 is the speech synthesis step, where the disambiguated list of pronunciation tokens is passed to a speech synthesizer (108), which pronounces it.
An example of the application of the above process is presented in FIG. 2, numeral 200. The text stream (201) is input to the tokenization step, which yields a list of text tokens (202) as its output. The list of text tokens is input to the lexicon access step, which yields a list of pronunciation tokens (203) as its output. As can be seen in the figure, several text tokens have more than one pronunciation token associated with them; e.g., "live" (which may be pronounced layv! or lihv!); the abbreviation "St." (which may be pronounced seynt! or striyt!), etc. The list of pronunciation tokens is input to the disambiguation step, which yields a list of disambiguated pronunciation tokens (204) as its output. The list of disambiguated pronunciation tokens is then input to the speech synthesizer, which pronounces it.
The static lexicon contains words, pronunciations, and information (such as part-of-speech tags and word frequencies) that is useful to the disambiguator in disambiguating word pronunciations according to their context. The dynamic lexicon contains procedures (e.g., an orthographic analysis routine) that can generate a plausible pronunciation for a word from its orthographic form. For the speech synthesis system to operate in real time on computing platforms with limited memory and cycle time, the static lexicon must be organized to permit high-speed random access while using minimal storage.
Standard file-compression techniques are not suitable for compressing the static lexicon because they do not allow random access to the compressed data. In addition, compression methods specifically developed for compressing the entries of conventional dictionaries are also unsuitable because of the differences in structure and usage between such dictionaries and the static lexicon. One crucial distinction between the static lexicon and a conventional dictionary is that the static lexicon need not contain any information that can be generated by the dynamic lexicon.
Additionally, the information stored in the static lexicon is substantially different than the information typically stored in the entries of a conventional dictionary, and is therefore not amenable to compression methods that do not exploit its specific regularities. One example of such a regularity is the application of morphologically based suffix-stripping rules to words regardless of the semantic consequences of such rules. For example, "ing" is a common English suffix with a predictable pronunciation. However, a rule that stripped "ing" without exception would be problematic in the conventional dictionary because it would complicate the accurate matching of word forms to their meanings; e.g., stripping "ing" from "bearing" would have to be optional in order to identify both the nominal base form (bearing) and the verbal base form (bear), while stripping it from "bring" would have to be prohibited entirely. However, because such rules may lead to the identification of plausible pronunciations in the static lexicon of a text-to-speech synthesis system, they are good candidates for exploitation in the compression of such a lexicon.
Hence, there is a need for a method, system and device for encoding the pronunciations stored in the lexicon of a text-to-speech synthesis system so that the storage requirements of the lexicon are substantially reduced without adversely impacting the rapid random access of pronunciations from the lexicon.