Speech synthesis is useful in any system where a written word is to be presented orally. It is possible to store a phonetic transcription of a number of words in a pronunciation dictionary, and play an oral representation of the phonetic transcription when the corresponding written word is recognised in the dictionary. However, such a system has a drawback in that it is only possible to output words that are held in the dictionary. Any word not in the dictionary cannot be output as no phonetic transcription is stored in such a system. While more words may be stored in the dictionary, along with their phonetic transcription, this leads to an increase in the size of the dictionary and associated phonetic transcription storage requirements. Furthermore, it is simply impossible to add all possible words to the dictionary, because the system may be presented with new words and words from foreign languages.
Therefore, it is advantageous to attempt to predict the phonetic transcription of words in the pronunciation dictionary, for two reasons. Firstly, phonetic transcription prediction will ensure that words that are not held in dictionary will receive a phonetic transcription. Secondly, words whose phonetic transcriptions are predictable can be stored in the dictionary without their corresponding transcriptions, thus reducing the size of the storage equipment requirement of the system.
One important component of the phonetic transcription of a word is the location of the word's primary lexical stress (the syllable in the word which is pronounced with the most emphasis). A method of predicting the location of lexical stress is thus an important component of predicting the phonetic transcription of a word.
Two basic approaches to lexical stress prediction currently exist. The earliest of these approaches are based entirely on manually specified rules (e.g., Church, 1985; patent U.S. Pat. No. 4,829,580; Ogden, patent U.S. Pat. No. 5,651,095), which have two principal drawbacks. Firstly, they are time consuming to create and maintain, which is especially problematic when creating rules for a new language or moving to a new phoneme set (a phoneme is the smallest phonetic unit within a language that is capable of conveying distinct meaning). Secondly, manually specified rules are generally not robust, generating poor results for words that differ significantly from those used to develop the rules, such as proper names and loanwords (words originating from a language other than that of the dictionary).
The second approach to lexical stress prediction is to use the local context around a target letter, i.e. the identities of the letters on each side of the target letter to determine the stress of the target letter, generally by some automatic technique such as decision trees or memory-based learning. This approach also has two drawbacks. Firstly, stress often cannot be determined simply on the local context (typically between 1 and 3 letters) used by these models. Secondly, decision trees and especially memory-based learning are not low-memory techniques, and thus would be difficult to adapt for use in low-memory text-to-speech systems.
It is therefore an object of the invention to provide a low memory text to speech system, and a further object of the invention to provide a method of preparing the same.