1. Field of the Invention
The present invention relates to automated synthesis of human speech from computer readable text, such as that stored in databases or generated by data processing systems automatically or via a user. Such systems are under current consideration and are being placed in use for example, by banks or telephone companies to enable customers to readily access information about accounts, telephone numbers, addresses and the like.
Text-to-speech synthesis is seen to be potentially useful to automate or create many information services. Unfortunately to date most commercial systems for automated synthesis remain too unnatural and machine-like for all but the simplest and shortest texts. Those systems have been described as sounding monotonous, boring, mechanical, harsh, disdainful, peremptory, fuzzy, muffled, choppy, and unclear. Synthesized isolated words are relatively easy to recognize, but when these are strung together into longer passages of connected speech (phrases or sentences) then it is much more difficult to follow the meaning: studies have shown that the task is unpleasant and the effort is fatiguing (Thomas and Rossen, 1985).
This less-than-ideal quality seems paradoxical, because published evaluations of synthetic speech yield intelligibility scores that are very close to natural speech. For example, Greene, Logan and Pisoni (1986) found the best synthetic speech could be transcribed with 96% accuracy; the several studies that have used human speech tokens typically report intelligibility scores of 96% to 99% for natural speech. (For a review see Silverman, 1987). The majority of these evaluations focus on segmental intelligibility: the accuracy with which listeners can transcribe the consonants and (much less commonly) vowels of short isolated words.
However, segmental intelligibility does not always predict comprehension. A series of experiments (Silverman et at, 1990a, 1990b; Boogaart and Silverman, 1992) compared two high-end commercially-available text-to-speech systems on application-like material such as news items, medical benefits information, and names and addresses. The result was that the system with the significantly higher segmental intelligibility had the lower comprehension scores. There is more to successful speech synthesis than just getting the phonetic segments right.
Although there may be several possible reasons for segmental intelligibility failing to predict comprehension, the invention offers an improved voice synthesis system that addresses the single most likely cause: synthesis of the text's prosody. Prosody is the organization imposed onto a string of words when they are uttered as connected speech. It primarily involves pitch, duration, loudness, voice quality, tempo and rhythm. In addition, it modulates every known aspect of articulation. These dimensions are effectively ignored in tests of segmental intelligibility, but when the prosody is incorrect then at best the speech will be difficult or impossible to understand (Huggins, 1978), at worst listeners will misunderstand it without being aware that they have done so.
The emphasis on segmental intelligibility in synthesis evaluation reflects long-standing assumptions that perception of speech is data-driven in a bottom-up fashion, and relatedly that the spectral modeling of vowels, consonants, and the transitions between them must therefore be the most impoverished and important component of the speech synthesis process. Consequently most research in speech synthesis is concerned with improving the spectral modeling at the segmental level.
In the present invention however, comprehensibility of the text synthesis is improved, inter alia, by addressing the prosodic treatment of the text, by adapting certain prosodic treatment rules exploiting a priori characteristics of the text to be synthesized, and by adopting prosodic treatment rules characteristic of the discourse, that is, the context within which the information in the text is sought by the user of the system. For example, as in the preferred embodiment discussed below, name and address information corresponding to user-inputted telephone numbers is desired by that user. The detailed description below will show how the text and context can be exploited to produce greater comprehensibility of the synthesized text.
2. Description of the Prior Art
In the prior art typical text-to-speech systems are designed to cope with "unrestricted text" (Allen et al, 1987). Synthesis algorithms for unrestricted text typically assign prosodic features on the basis of syntax, lexical properties, and word classes. This often works moderately well for short simple declarative sentences, but in longer texts or dialogs the meaning is very difficult to follow. In a system designed for unrestricted text, it is difficult to infer the information structure of the text and how it relates to the prior knowledge of the speaker and hearer. The approach taken in these systems to generating the prosody has been to derive it from an impoverished (i.e. significantly more limited than than the theoretical possibility) syntactic analysis of the text to be spoken. For example, prior art systems have prosody confined to simple rules designed into them, such as:
1. Content words receive pitch-related prominence, function words do not. Hence the prominences (indicated in bold) in a sentence such as:
synthetic speech is easy to understand PA1 synthetic speech .vertline. is easy .vertline. to understand PA1 * De-accenting in complex nominals PA1 * Boundary placement around conjunctions PA1 * Reducing the prosodic salience of inferable markers of information-structure PA1 * Resolving numerical adjacency PA1 * Bracketing PA1 * Prosodic separation of sequenced information units PA1 * Overall prosodic shaping of a discourse turn PA1 * Strategies for explicit spelling PA1 * Interactive adaptation of speaking rate
2. Small boundaries, marked with pitch falls and some lengthening of the syllables on the left, are placed wherever there is a content word on the left and a function word on the right. Hence the boundaries (indicated with .vertline.)
3. Larger boundaries are placed at punctuation marks. These are accompanied by a short pause, and preceded by either a falling-then-rising pitch shape to cue non-finality in the case of a comma, or finality in the case of a period.
4. Pitch is relatively high at the start of a sentence, and declines over the duration of the sentence to end relatively lower at the end. The local pitch excursions associated with word prominences and boundaries are superposed onto this global downward trend. The global trend is called declination. It is reset at the start of every sentence, and may also be partially reset at punctuation marks within a sentence.
5. There are several ways in which minor deviations from the above principles can be implemented to add variety and interest to an intonation contour. For example in the MITalk system, which is the basis for the well-known DECtalk commercial product, the extent of prominence-lending pitch excursions on content words depends on lexical properties of the word: interrogative adjectives are assigned more emphasis (higher pitch targets), verbs are assigned the least (lower targets), and so on.
Different state-of-the-art synthesizers all use basically the same approach, each with their own embellishments, but the general approach is that the prosody is predicted from the intrinsic characteristics of the to-be-synthesized text. This is a necessary consequence of the decision to deal with unrestricted text. The problem with this approach is that prosody is not a lexical property of English words--English is not a tone language. Neither is prosody completely predictable from English syntax--prosody is not a redundant encoding of surface grammatical structure.
Rather, prosody is used by speakers to annotate the information structure of the text string. It depends on the prior mutual knowledge of the speaker and listener, and on the role a particular utterance takes within its particular discourse. It marks which words and concepts are considered by the speaker to be new in the dialogue, it marks which ones are topics and which ones are comments, it encodes the speaker's expectations about what the listener already believes to be true and how the current utterance relates to that belief, it segments a string of sentences into a block structure, it marks digressions, it indicates focused versus background information, and so on. This realm of information is of course unavailable in an unrestricted text-to-speech system, and hence such systems are fundamentally incapable of generating correct discourse-relevant prosody. This is a primary reason why prosody is a bottleneck in speech synthesis quality.
Commercially available synthesizers contain the capability to execute prosody from indicia or markers generated from the internal prosody rules. Many can also execute prosody from indicia supplied externally from a further source. All these synthesizers contain internal features to generate speech (such as in section 32 of the synthesizer 30 of FIG. 1) from indicia and text. In some, internally derived machine-interpretable prosody indicia based on the machine's internal rules (such as may be generated in section 31 of the synthesizer 30 of FIG. 1) are capable of being overridden or replaced or supplemented. Accordingly, one object of the invention in its preferred embodiment is achieved by providing synthesizer understandable prosody indicia from a supplemental prosody processor, such as that illustrated as preprocessor 40 in FIG. 2 to supplant or override the internal prosody features. Since most real applications of language technology only deal with a constrained topic domain, the invention exploits these constraints to improve the prosody of synthetic speech. This is because within the constraints of a particular application it is possible to make many assumptions about the type of text structures to expect, the reasons the text is being spoken, and the expectations of the listener, i.e., just the types of information that are necessary to determine the prosody. This indicates a further aim of the invention, namely, application-specific rules to improve the prosody in a given text-to-speech synthesis application.
There have been attempts made in the past to use the discourse constraints of an application context to generate prosody. Significant pieces of work include:
1. Steven Young and Frank Fallside (Young and Fallside, 1979, 1980) built an application that enabled remote access to status information about East Anglia's water supply system. Field personnel could make telephone calls to an automated system which would answer queries by generating text around numerical data and then synthesizing the resulting sentences. All the desired prosody markers were hand-generated along with the text, and hand-embedded within it rather than being generated automatically on an automated analysis of the text.
2. Julia Hirschberg and Janet Pierrehumbert (1986) developed a set of principles for manipulating the prosody according to a block structure model of discourse in an automated tutor for the vi (a standard text editor). The tutoring program incorporated text-to-speech synthesis to speak information to the student. Here too, however, the prosody was a result of hand-coding of text rather than via an automated text analysis.
3. Jim Davis (1988) built a navigation system that generated travel directions within the Boston metropolitan area. Users are presented with a map of Boston on a computer screen: they can indicate where they currently are, and where they would like to be. The system then generates the text for directions for how to get there. In one version of the system, elements of the discourse structure (such as given-versus-new information, repetition, and grouping of sentences into larger units) were imbedded directly in the text by the designer to represent accent placement, boundary placement, and pitch range, rather than being generated by a automated marker generation scheme.
The inventor (see U.S. Pat. No. 4,908,867) has also developed a set of rules to incorporate some aspects of discourse structure into synthetic prosody to improve unrestricted text prosody. Some rules systematically varied pitch range to mark such phenomena as the scope of propositions, beginnings and ends of speaker turns, and hierarchical groupings of prosodic sentences. Other rules used a FIFO buffer of the roots of content words to model the listener's short-term memory for currently-evoked discourse concepts, in order to guide the placement of prominences. Still others used phrasal verbs to correct prosodic boundaries (to correctly distinguish, for instance, between "Turn on .vertline. a light" and "Turn .vertline. on the second exit"). and performed deaccenting in complex nominals (to give different prosodic treatment, for instance, to "Buildings Galore" as opposed to "Building Company"). These rules were put to a formal evaluation: they were used to synthesize a set of multi-sentence, multi-paragraph texts from a number of different application domains (such as news briefs, advertisements, and instructions for using machinery). Each text was designed such that the last sentence of one paragraph could alternatively be the first sentence of the next paragraph, with a consequent well-defined change in the overall meaning of the text. Twenty volunteers heard one or other version of each text, with the crucial difference marked by the prosody rules, and answered comprehension questions that focused on how they had understood the relevant aspects of the overall meaning. The prosody was found to predict the listeners' comprehension 84% of the time.
However, it remains unclear whether similar prosodic phenomena will influence perception of synthetic speech with real users rather than volunteers, on less controlled and more variable material, in a real-world application. This has theoretical implications--the importance of prosodic organization in models of speech production should reflect its pervasiveness in speech perception--as well as practical implications for effectively exploiting speech synthesis to facilitate remote access to information. For these reasons, this invention addresses prosodic modeling in the context of an existing information-provision service. As can be seen, no automated prosody generation feature (capable of automatically analyzing text,) had been yet provided to exploit the particular characteristics of restricted text and the dialog with the user to improve the prosody performance of the then state-of-the-art synthesis devices.
Taking these considerations into account, a speech synthesis system according to the invention has been achieved with the general object of exploiting--for convenience--the existing commercially available synthesis devices, even though these had been designed for unrestricted text. As a specific object, the invention seeks to automatically apply prosodic rules to the text to be synthesized rather than those applied by the designed-in rules of the synthesizer device. More specifically, the invention has the more specific object of utilizing prosody rules applied to an automated text analysis to exploit prosodic characteristics particular to and readily ascertainable from the type and format of the text itself, and from the context and purpose of the discourse involving end-user access to that text. Moreover, improved adaptive speaking rate and enhanced spelling features applicable to both restricted and unrestricted text are provided as a further object. The following discussion will make apparent how these objects may be achieved by the invention, particularly in the context of a preferred embodiment: a synthesized name and address application in a telephone system.