Generally speaking, a "speech synthesizer" is a computer device or system for generating audible speech from written text. That is, a written form of a string or sequence of characters (e.g., a sentence) is provided as input, and the speech synthesizer generates the spoken equivalent or audible characterization of the input. The generated speech output is not merely a literal reading of each input character, but a language dependent, in-context verbalization of the input. If the input was the phone number (508) 691-1234 given in response to a prior question of "What is your phone number?", the speech synthesizer does not produce the reading "parenthesis, five hundred eight, close parenthesis, six hundred ninety-one . . . " Instead, the speech synthesizer recognizes the context and supporting punctuation and produces the spoken equivalent "five (pause) zero (pause) eight (pause) six . . . " just as an English-speaking person normally pronounces a phone number.
Historically the first speech synthesizers were formed of a dictionary, engine and digital vocalizer. The dictionary served as a look-up table. That is, the dictionary cross referenced the text or visual form of a character string (e.g., word or other unit) and the phonetic pronunciation of the character string/word. In linguistic terms the visual form of a character string unit (e.g., word) is called a "grapheme" and the corresponding phonetic pronunciation is termed a "phoneme". The phonetic pronunciation or phoneme of character string units is indicated by symbols from a predetermined set of phonetic symbols.
The engine is the working or processing member that searches the dictionary for a character string unit (or combination thereof) matching the input text. In basic terms, the engine performs pattern matching between the sequence of characters in the input text and the sequence of characters in "words" (character string units) listed in the dictionary. Upon finding a match, the engine obtains from the dictionary entry (or combination of entries) of the matching word (or combination of words), the corresponding phoneme or combination of phonemes. To that end, the purpose of the engine is thought of as translating a grapheme (input text) to a corresponding phoneme (the corresponding symbols indicating pronunciation of the input text).
Typically the engine employs a binary search through the dictionary for the input text. The dictionary is loaded into the computer processor physical memory space (RAM) along with the speech synthesizer program. The memory footprint, i.e., the physical memory space in RAM needed while running the speech synthesizer program, thus must be large enough to hold the dictionary. Where the dictionary portion of today's speech synthesizers continue to grow in size, the memory footprint is problematic due to the limited available memory (RAM and ROM) in some/most applications.
The digital vocalizer receives the phoneme data generated by the engine. Based on the phoneme data together with timing and stress data, the digital vocalizer generates sound signals for "reading" or "speaking" the input text. Typically, the digital vocalizer employs a sound and speaker system for producing the audible characterization of the input text.
To improve on memory requirements of speech synthesizers, another design was developed. In that design, the dictionary is replaced by a rule set. Alternatively, the rule set is used in combination with the dictionary instead of completely substituting therefor. At any rate, the rule set is a group of statements in the form
IF (condition)-then-(phonemic result)
Each such statement determines the phoneme for a grapheme that matches the IF condition. Examples of rule-based speech synthesizers are DECTALK by Digital Equipment Corporation of Maynard, Mass. and TrueVoice by Centigram Communications of San Jose, Calif. Though the use of rule sets reduces the number of entries required in a dictionary for a speech synthesizer system, the dictionaries remain relatively large in size (i.e., number of entries) compared to other parts of the system requiring memory. This is problematic because dictionaries must be completely stored in memory during the speech synthesis process to ensure fast and efficient look-up of entries if needed.
These and other problems exist in speech synthesizer technology. New solutions have been attempted but with little success. As a result, highly accurate and/or memory space efficient speech synthesizers are yet to come.