Generally speaking, a xe2x80x9cspeech synthesizerxe2x80x9d is a computer device or system for generating audible speech from written text. That is, a written form of a string or sequence of characters (e.g., a sentence) is provided as input, and the speech synthesizer generates the spoken equivalent or audible characterization of the input. The generated speech output is not merely a literal reading of each input character, but a language dependent, in-context verbalization of the input. If the input was the phone number (508) 691-1234 given in response to a prior question of xe2x80x9cWhat is your phone number?xe2x80x9d, the speech synthesizer does not produce the reading xe2x80x9cparenthesis, five hundred eight, close parenthesis, six hundred ninety-one . . . . xe2x80x9d Instead, the speech synthesizer recognizes the context and supporting punctuation and produces the spoken equivalent xe2x80x9cfive (pause) zero (pause) eight (pause) six . . . xe2x80x9d just as an English-speaking person normally pronounces a phone number.
Historically the first speech synthesizers were formed of a dictionary, engine and digital vocalizer. The dictionary served as a look-up table. That is, the dictionary cross referenced the text or visual form of a character string (e.g., word or other unit) and the phonetic pronunciation of the character string/word. In linguistic terms the visual form of a character string unit (e.g., word) is called a xe2x80x9cgraphemexe2x80x9d and the corresponding phonetic pronunciation is termed a xe2x80x9cphonemexe2x80x9d. The phonetic pronunciation or phoneme of character string units is indicated by symbols from a predetermined set of phonetic symbols.
The engine is the working or processing member that searches the dictionary for a character string unit (or combination thereof) matching the input text. In basic terms, the engine performs pattern matching between the sequence of characters in the input text and the sequence of characters in xe2x80x9cwordsxe2x80x9d (character string units) listed in the dictionary. Upon finding a match, the engine obtains from the dictionary entry (or combination of entries) of the matching word (or combination of words), the corresponding phoneme or combination of phonemes. To that end, the purpose of the engine is thought of as translating a grapheme (input text) to a corresponding phoneme (the corresponding symbols indicating pronunciation of the input text).
Typically the engine employs a binary search through the dictionary for the input text. The dictionary is loaded into the computer processor physical memory space (RAM) along with the speech synthesizer program. The memory footprint, i.e., the physical memory space in RAM needed while running the speech synthesizer program, thus must be large enough to hold the dictionary. Where the dictionary portion of today""s speech synthesizers continue to grow in size, the memory footprint is problematic due to the limited available memory (RAM and ROM) in some/most applications.
The digital vocalizer receives the phoneme data generated by the engine. Based on the phoneme data together with timing and stress data, the digital vocalizer generates sound signals for xe2x80x9creadingxe2x80x9d or xe2x80x9cspeakingxe2x80x9d the input text. Typically, the digital vocalizer employs a sound and speaker system for producing the audible characterization of the input text.
To improve on memory requirements of speech synthesizers, another design was developed. In that design, the dictionary is replaced by a rule set. Alternatively, the rule set is used in combination with the dictionary instead of completely substituting therefor. At any rate, the rule set is a group of statements in the form
IF (condition)-then-(phonemic result) Each such statement determines the phoneme for a grapheme that matches the IF condition. Examples of rule-based speech synthesizers are DECTALK by Digital Equipment Corporation of Maynard, Mass. and TrueVoice by Centigram Communications of San Jose, Calif. Though the use of rule sets reduces the number of entries required in a dictionary for a speech synthesizer system, the dictionaries remain relatively large in size (i.e., number of entries) compared to other parts of the system requiring memory. This is problematic because dictionaries must be completely stored in memory during the speech synthesis process to ensure fast and efficient look-up of entries if needed.
These and other problems exist in speech synthesizer technology. New solutions have been attempted but with little success. As a result, highly accurate and/or memory space efficient speech synthesizers are yet to come.
Dictionaries used by text-to-speech synthesis systems may grow to become quite large. Dictionary size depends on how many words or word portions in a particular language are determined to be too complex, too difficult or too time consuming to translate into phonemes by rule set processing alone. Such words or word portions are candidates to be included as entries in the dictionary. However, certain problems are encountered when large dictionaries are used in text-to-speech synthesis systems as mentioned above.
The invention recognizes the problems with prior art text-to-speech synthesis systems that use dictionaries and provides an apparatus to reduce the overall size of the dictionaries used in such systems. Specifically, the invention uses a two phase dictionary reduction process to eliminate entries that are not required in the dictionary. In phase one, any entries in the dictionary with respective phonemes that can be fully generated by rules in a rule set are marked or indicated to be deleted from the dictionary. In phase two, any entries in the dictionary, called root word entries, that can provide phonemes for the text-to-speech translation process of larger (longer) entries are marked or indicated to be saved in the dictionary, and the entries of longer character strings that can be translated using the shorter root word entries in conjunction with rules are indicated to be deleted from the dictionary. After phase one and/or phase two are complete, the invention aggregates the entries marked to be saved or removes the entries marked to be deleted and the resulting set of entries is stored as the reduced dictionary.
Phase one or phase two of the invention each may be performed independently, followed by the aggregation step. Alternatively, phase one may be followed by phase two and then by the aggregation process.
In order for embodiments of phase one to determine if the phoneme of an entry in the dictionary can be fully generated (and hence the dictionary entry can be fully matched) by using the rule set, the invention apparatus generates a rule-based phoneme string for the grapheme string of the subject entry and then determine if the rule-based phoneme string matches the corresponding phoneme string of the entry. If there is a match, the subject entry is indicated to be deleted from the dictionary, thus reducing overall dictionary size. Since rules alone can produce the required phoneme string for the subject entry, the invention recognizes that there is no need for the entry to remain in the dictionary.
Embodiments of phase one may also check if the grapheme string of a dictionary entry is a homograph. If so, the preferred embodiment skips to the next entry in the dictionary for processing. A homograph is a word that can be pronounced two different ways but which has one spelling, such as xe2x80x9cabstractxe2x80x9d, xe2x80x9cwindxe2x80x9d, and xe2x80x9crecordxe2x80x9d. Due to multiple pronunciations, homograph dictionary entries are skipped since they may have more than one associated phoneme string. During text-to-speech processing, the correct phoneme string is selected from a homograph dictionary entry based on the context of surrounding language in the text being translated.
Embodiments of phase two determine if dictionary entries, referred to as root word entries, are required in the dictionary. This is accomplished by the invention combining grapheme and phoneme strings of the root word entry from the dictionary with respective grapheme and phoneme portions of an affix rule of an affix rule set of the speech syntheses system. This step of combining forms a grapheme combination and phoneme combination pair. Phase two then determines if the grapheme combination and phoneme combination pair exists as another matching entry in the dictionary, and if so, indicates the root word entry to be saved in the dictionary. The matching entry is thus marked for removal/deletion. Thus, phase two saves root words in the dictionary that can be used to assist in the translation of another longer word (the matching entry) in conjunction with rule-based processing, and removes the matching entries from the dictionary which can be correctly translated with a combination of rule processing and root word phonemes.
To create the grapheme combination and phoneme combination pair, embodiments of phase two select and process each root word entry in the dictionary. Specifically for each root word entry, the invention combines the grapheme string of the root word entry with the grapheme portion of the affix rule to form a grapheme combination, and combines the phoneme string of the root word entry with the phoneme portion of the affix rule to form a phoneme combination. Then phase two determines if the grapheme combination exists as a matching grapheme string in an entry in the dictionary. If so, the invention obtains the corresponding phoneme string as a matching phoneme string for the matching entry. Then, phase two determines if the phoneme combination matches the matching phoneme string, and if so, indicates the root word entry to be saved in the dictionary. Thus, the root words that are saved in the dictionary are root words that can be used in the translation of the other matching entries. Phase two also determines if the matching entry has been indicated to be saved in the dictionary. If not, the invention indicates the matching entry to be deleted from the dictionary. As such, phase two reduces the dictionary size by determining which entries rely on phonemes of root words, and saves the root words and deletes entries that can be matched by the root words and rule processing.
By using either phase one or phase two alone, or phase one followed by phase two, the invention reduces the number of entries in a dictionary. To that end, the invention computer apparatus forms a reduced (i.e., smaller in size) dictionary. The reduced dictionary is adaptable to text-to-speech synthesis applications requiring minimal storage space, entry search time, and dictionary load time.