As the popularity of computer systems and computer use grows, new technologies and applications are being developed which make computers appear to act more like people than machines. One such area of technology is computer speech development and processing. Speech synthesis, the ability of a computer to "read" and "speak" text to a user, is a complex task requiring large amounts of computer processing power and intricate programming steps.
In known computer speech synthesis systems, various mechanisms have been developed to allow a computer to "speak" text that is input or stored within the computer. These systems convert text into speech signals which are electronically "spoken" or vocalized through a speaker.
Generally, current speech synthesis systems are formed of a dictionary, a search (or processing) engine, and a digital vocalizer. The dictionary serves as a look-up table. That is, the dictionary cross references the text or visual form of a character string (e.g., word or other unit) and the phonetic pronunciation of the character string/word. In linguistic terms, the visual form of a character string unit (e.g., word) is generally called a "grapheme" and the corresponding phonetic pronunciation is termed a "phoneme." The phonetic pronunciation or phoneme of a character string unit is indicated by symbols from a predetermined set of phonetic symbols. To date, there is little standardization of phoneme symbol sets and usage of the same in speech synthesizers.
The engine is the working or processing member that searches the dictionary for a character string unit (or combination thereof) matching the input text. In basic terms, the engine performs pattern matching between the sequence of characters in the input text and the sequence of characters in "words" (character string units) listed in the dictionary. Upon finding a match, the engine obtains from the dictionary entry (or combination of entries) of the matching word (or combination of words), the corresponding phonemes or combination of phonemes. To that end, the purpose of the engine is thought of as translating a grapheme (input text) to a corresponding phoneme (the corresponding symbols indicating pronunciation of the input text).
Typically the engine employs a binary search through the dictionary for the input text. The dictionary is loaded into the computer processor physical memory space (RAM) along with the speech synthesizer program. The memory footprint, i.e., the physical memory space in RAM needed while running the speech synthesizer program, thus must be large enough to hold the dictionary. Where the dictionary portion of today's speech synthesizers continue to grow in size, the memory footprint is problematic due to the limited available memory (RAM and ROM) in some/most applications.
Thus to further improve speech synthesizers, another design was developed. In that design, the dictionary is replaced by a rule set. Alternatively, the rule set is used in combination with the dictionary instead of completely substituting therefor. At any rate, the rule set may be represented as a group of statements in the form:
IF (condition)-then-(phonemic result).
Each such statement (or rule) determines the phoneme for a grapheme that matches the IF condition. Examples of rule-based speech synthesizers are DECtalk by Digital Equipment Corporation of Maynard, Mass. and TrueVoice by Centigram Communications of San Jose, Calif.
In a rule-based speech synthesis system, each rule of the rule set is considered with respect to the input text. Processing typically proceeds one word or unit at a time from the beginning to the end of the original text. Each word or input text unit is then processed in right to left fashion. If the rule conditions ("If-Condition" part of the rule) match any portion of the input text, then the engine determines that the rule applies. As such, the engine stores the corresponding phoneme data (i.e., phonemic result) from the rule in a buffer. The engine similarly processes each succeeding rule in the rule set against the input text (i.e., remainder parts thereof for which phoneme data is needed). After processing all the rules of the rule set, the buffer holds the phoneme data corresponding to the input text.
The resultant phoneme data from the buffer is then used by the digital vocalizer to electronically produce an audible characterization of the phonemes that have been strung together in the buffer. The digital vocalizer generates electrical sound signals for each phoneme together with appropriate pausing and emphasis based on relations and positions to other phonemes in the buffer.
The generated electrical sound signals are converted by a transducer (e.g., a loudspeaker) to sound waves that "speak" the input text. To the listening user, the computer system (i.e., speech synthesizer) appears to be "reading" or "speaking" the input text.
There are a limited number of phonemes for any particular language, such as English. The entire set of phonemes for a language generally represents each sound utterance that can be made when speaking words in that language. Character arrangements for words in a language, however, may exist in an almost infinite number of arrangements. Sometimes, as input text is scanned, certain arrangements of characters will not match any rule in the rule set. In this case, the dictionary is maintained of words and portions of words which do not fit or easily match rules within the rule set. The engine first consults the dictionary to lookup an appropriate phoneme representation for the input text or portion(s) thereof. From the subject dictionary entry, the phoneme or phonemes for that portion of the input text are placed into the buffer. If certain words/character strings do not match entries in the dictionary, then the speech synthesizer applies the rules to obtain a phonetic pronunciation for those words.