This invention relates to a novel text-to-speech synthesizer, to a speech synthesizing method and to products embodying the speech synthesizer or method, including voice recognition systems. The methods and systems of the invention are suitable for computer implementation, e.g. on personal computers, and other computerized devices, the invention also includes such computerized systems and methods.
Three different kinds of speech synthesizers have been described theoretically, namely articulatory, formant and concatenated speech synthesizers. Formant and concatenated speech synthesizers have been developed for commercial use.
The formant synthesizer was an early, highly mathematical speech synthesizer. The technology of formant synthesis is based on acoustic modeling employing parameters related to a speaker's vocal tract such as the fundament frequency, length and diameter of the vocal tract, air pressure parameters and so on. Formant-based speech synthesis may be fast and low cost, but the sound generated is esthetically unsatisfactory to the human ear. It is usually artificial and robotic or monotonous.
Synthesizing the pronunciation of a single word requires sounds that correspond to the articulation of consonants and vowels so that the word is recognizable. However, individual words have multiple ways of being pronounced, such as formal and informal pronunciations. Many dictionaries provide a guide not only to the meaning of a word, but also to its pronunciation. However, pronouncing each word in a sentence according to a dictionary's phonetic notations for the word results in monotonous speech which is singularly unappealing to the human ear.
To address this problem, prior to the present invention, many commercially available speech synthesizers employed a concatenative speech synthesis method. Basic speech units in the International Phonetic Alphabet (IPA) dictionary for example phonemes, diphones, and triphones, are recorded from an individual's pronunciations and are “concatenated”, or chained together to form synthesized speech. While the output concatenative speech quality may be better than that of formative speech, the audible experience in many cases is still unsatisfactory, owing to problems known as “glitches” which may be attributable to imperfect merges between adjacent speech units.
Other significant drawbacks of concatenated synthesizers are requirements for large speech unit databases and high computational power. In some cases, concatenated synthesis employing whole words and sometimes phrases of recorded speech, may make voice identity characteristics clearer. Nevertheless, the speech still suffers from poor prosody when one listens to sentences and paragraphs of “synthesized” speech using the longer prerecorded units. “Prosody” can be understood as involving the pace, rhythmic and tonal aspects of language. It may also be considered as embracing the qualities of properly spoken language that distinguish human speech from traditional concatenated and formant machine speech which is generally monotonous.
Known text-normalizers and text-parsers employed in speech synthesizers are word-by-word and, in the case of concatenated synthesis, sometimes phrase-by-phrase. The individual word approach, even with individual word stress, quickly becomes perceived as robotic. The concatenated approach, while having some improved voice quality, soon becomes repetitious, and glitches may result in misalignments of amplitudes and pitch.
The natural musicality of the human voice may be expressed as prosody in speech, the elements of which include the articulatory rhythm of the speech and changes in pitch and loudness. Traditional formant speech synthesizers cannot yield quality synthesized speech with prosodies relevant to the text to be pronounced and relevant to the listener's reason for listening. Examples of such prosodies are reportorial, persuasive, advocacy, human interest and others.
Natural speech has variations in pitch, rhythm, amplitude, and rate of articulation. The prosodic pattern is associated with surrounding concepts, that is, with prior and future words and sentences. Known speech synthesizers do not satisfactorily take account of these factors. Addison, et al. commonly owned U.S. Pat. Nos. 6,865,533 and 6,847,931 disclose and claim methods and systems employing expressive parsing.
The foregoing description of background art may include insights, discoveries, understandings or disclosures, or associations together of disclosures, that were not known to the relevant art prior to the present invention but which were provided by the invention. Some such contributions of the invention may have been specifically pointed out herein, whereas other such contributions of the invention will be apparent from their context. Merely because a document may have been cited here, no admission is made that the field of the document, which may be quite different from that of the invention, is analogous to the field or fields of the present invention.