The present invention relates to methods and apparatus for synthesizing speech from text.
A wide variety of electronic systems that convert text to speech sounds are known in the art. Usually the text is supplied in an electrical digitally coded format, such as ASCII, but in principle it does not matter how the text is initially presented. Every text-to-speech (TTS) system, however, must convert the input text to a phonetic representation, or pronunciation, that is then converted into sound. Thus, a TTS system can be characterized as a transducer between representations of the text. Much effort has been expended to make the output of TTS systems sound "more natural" viz more like speech from a human and less like sound from a machine.
A very simple system might use merely a fixed dictionary of word-to-phonetic entries. Such a dictionary would have to be very large in order to handle a sufficiently large number of words, and a high-speed processor would be necessary to locate and retrieve entries from the dictionary with sufficiently high speed.
To help avoid such drawbacks, other systems, such as that described in U.S. Pat. No. 4,685,135 to Lin et al., use a set of rules for conversion of words to phonetics. In the Lin system, phonetics-to-sound conversion is accomplished with allophones and linear predictive coding (LPC), and stress marks must be added by hand in the input text stream. Unfortunately, a system using a simplistic set of rules for converting words to phonetic representations will inevitably produce erroneous pronunciations for some words because many languages, including English, have no simple relationship between orthography and pronunciation. For example, the orthography, or spelling, of the English words "tough", "though", and "through" bears little relation to their pronunciation.
Accordingly, some systems, such as that described in U.S. Pat. No. 4,692,941 to Jacks et al., convert orthography to phonemes by first examining a keyword dictionary (giving pronouns, articles, etc.) to determine basic sentence structure, then checking an exception dictionary for common words that fail to follow the rules, and then reverting to the rules for words not found in the exception dictionary. In the system described in the Jacks et al. patent, the phonemes are converted to sound using a time-domain technique that permits manipulation of pitch. The patent suggests that inflection, speech and pause data can be determined from the keyword information according to standard rules of grammar, but those methods and rules are not provided, although the patent mentions a method of raising the pitch of words followed by question marks and lowering the pitch of words followed by a periods.
Another prior TTS system is described in U.S. Pat. No. 4,979,216 to Malsheen et al., which uses rules for conversion to phonetics and a large exception dictionary of 3000-5000 words. The basic sound unit is the phoneme or allophone, and parameters are stored as formants.
Such systems inevitably produce erroneous pronunciations because many languages have words, or character strings, that have several pronunciations depending on the grammatical roles the strings play in the text. For example, the English strings "record" and "invalid" both have two pronunciations in phrases such as "to record a new record" and "the invalid's invalid check".
In dealing with such problems, most prior TTS systems either avoid or treat secondarily the problem of varying the stress of output syllables. A TTS system could ignore stress variations, but the result would probably be unintelligible as well as sound unnatural. Some systems, such as that described in the Lin patent cited above, require that stress markers be inserted in the text by outside means: a laborious process that defeats many of the purposes of a TTS system.
"Stress" refers to the perceived relative force with which a sound, syllable, or word is uttered, and the pattern of stresses in a sequence of words is a highly complicated function of the physical parameters of frequency, amplitude, and duration. "Orthography" refers to the system of spelling used to represent spoken language.
In contrast to the approaches of prior TTS systems, it is believed that the accuracy of stress patterns can be even more important than the accuracy of phonetics. To achieve stress pattern accuracy, however, a TTS system must take into account that stress patterns also depend on grammatical role. For example, the English character strings "address", "export", and "permit" have different stress patterns depending on whether they are used as nouns or verbs. Applicant's TTS system considers stress (and phonetic accuracy in the presence of orthographic irregularities) to be so important that it uses a large dictionary and a natural-language parser, which determines the grammatical role each word plays in the sentence and then selects the pronunciation that corresponds to that grammatical role.
It should be appreciated that Applicant's system does more than merely make an exception dictionary larger; the presence of the grammatical information in the dictionary and the use of the parser result in a system that is fundamentally different from prior TTS systems. Applicant's approach guarantees that the basic glue of English is handled correctly in lexical stress and in phonetics, even in cases that would be ambiguous without the parser. The parser also provides information on sentence structure that is important for providing the correct intonation on phrases and clauses, i.e., for extending intonation and stress beyond individual words, to produce the correct rhythm of English sentences. The parser in Applicant's system enhances the accuracy of the stress variations in the speech produced among other reasons because it permits identification of clause boundaries, even of embedded clauses that are not delimited by punctuation marks.
Applicant's approach is extensible to all languages having a written form in a way that rule-based text-to-phonetics converters are not. For a language like Chinese, in which the orthography bears no relation to the phonetics, this is the only option. Also for languages like Hebrew or Arabic, in which the written form is only "marginally" phonetic (due, in those two cases, to the absence of vowels in most text), the combination of dictionary and natural-language parser can resolve the ambiguities in the text and provide accurate output speech.
Applicant's approach also offers advantages for languages (e.g., Russian, Spanish, and Italian) that may be superficially amenable to rule-based conversion (i.e., where rules might "work better" than for English because the orthography corresponds more closely to the phonetics). For such languages, the combination of a dictionary and parser still provides the information on sentence structure that is critical to the production of correct intonational patterns beyond the simple word level. Also for languages having unpredictable stress (e.g., Russian, English, and German), the dictionary itself (or the combination of dictionary and parser) resolves the stress patterns in a way that a set of rules cannot.
Most prior systems do not use a full dictionary because of the memory required; the Lin et al. patent suggests that a dictionary of English words requires 600 K bytes of RAM. Applicant's dictionary with phonetic and grammatical information requires only about 175 K bytes. Also, it is often assumed that a natural-language parser of English would be too time consuming for practical systems.
This invention is an innovative approach to the problem of text-to-speech synthesis, and can be implemented using only the minimal processing power available on MACINTOSH-type computers available from Apple Computer Corp. The present TTS system is flexible enough to adapt to any language, including languages such as English for which the relationship between orthography and phonetics is highly irregular. It will be appreciated that the present TTS system, which has been configured to run on Motorola M68000 and Intel 80386SX processors, can be implemented with any processor, and has increased phonetic and stress accuracy compared to other systems.
Applicant's invention incorporates a parser for a limited context-free grammar (as contrasted with finite-state grammars) that is described in Applicant's commonly assigned U.S. Pat. No. 4,994,966 for "System and Method for Natural Language Parsing by Initiating Processing prior to Entry of Complete Sentences" (hereinafter "the '966 patent"), which is hereby incorporated in this application by reference. It will be understood that the present invention is not limited in language or size of vocabulary; since only three or four bytes are needed for each word, adequate memory capacity is usually not a significant concern in current small computer systems.