FIG. 1 of the accompanying drawings is a block diagram of an exemplary prior-art speech system comprising an input channel 11 (including speech recognizer 5) for converting user speech into semantic input for dialog manager 7, and an output channel (including text-to-speech converter (TTS) 6) for receiving semantic output from the dialog manager for conversion to speech. The dialog manager 7 is responsible for managing a dialog exchange with a user in accordance with a speech application script, here represented by tagged script pages 15. This exemplary speech system is particularly suitable for use as a voice browser with the system being adapted to interpret mark-up tags, in pages 15, from, for example, four different voice markup languages, namely:                dialog markup language tags that specify voice dialog behavior;        multimodal markup language tags that extend the dialog markup language to support other input modes (keyboard, mouse, etc.) and output modes (e.g. display);        speech grammar markup language tags that specify the grammar of user input; and        speech synthesis markup language tags that specify voice characteristics, types of sentences, word emphasis, etc.        
When a page 15 is loaded into the speech system, dialog manager 7 determines from the dialog tags and multimodal tags what actions are to be taken (the dialog manager being programmed to understand both the dialog and multimodal languages 19). These actions may include auxiliary functions 18 (available at any time during page processing) accessible through application program interfaces (APIs) and including such things as database lookups, user identity and validation, telephone call control etc. When speech output to the user is called for, the semantics of the output is are passed, with any associated speech synthesis tags, to output channel 12 where a language generator 23 produces the final text to be rendered into speech by text-to-speech converter 6 and output (generally via a communications link) to speaker 17. In the simplest case, the text to be rendered into speech is fully specified in the voice page 15 and the language generator 23 is not required for generating the final output text; however, in more complex cases, only semantic elements are passed, embedded in tags of a natural language semantics markup language (not depicted in FIG. 1) that is understood by the language generator. The TTS converter 6 takes account of the speech synthesis tags when effecting text to speech conversion for which purpose it is cognizant of the speech synthesis markup language 25.
User speech input is received by microphone 16 and supplied (generally via a communications link) to an input channel of the speech system. Speech recognizer 5 generates text which is fed to a language understanding module 21 to produce semantics of the input for passing to the dialog manager 7. The speech recognizer 5 and language understanding module 21 work according to specific lexicon and grammar markup language 22 and, of course, take account of any grammar tags related to the current input that appear in page 15. The semantic output to the dialog manager 7 may simply be a permitted input word or may be more complex and include embedded tags of a natural language semantics markup language. The dialog manager 7 determines what action to take next (including, for example, fetching another page) based on the received user input and the dialog tags in the current page 15.
Any multimodal tags in the voice page 15 are used to control and interpret multimodal input/output. Such input/output is enabled by an appropriate recogniser 27 in the input channel 11 and an appropriate output constructor 28 in the output channel 12.
A barge-in control functional block 29 determines when user speech input is permitted over system speech output. Allowing barge-in requires careful management and must minimize the risk of extraneous noises being misinterpreted as user barge-in with a resultant inappropriate cessation of system output. A typical minimal barge-in arrangement in the case of telephony applications is to permit the user to interrupt only upon pressing a specific dual tone multi-frequency (DTMF) key, the control block 29 then recognizing the tone pattern and informing the dialog manager that it should stop talking and start listening. An alternative barge-in policy is to only recognize user speech input at certain points in a dialog, such as at the end of specific dialog sentences, not themselves marking the end of the system's “turn” in the dialog. This can be achieved by having the dialog manager notify the barge-in control block 29 of the occurrence of such points in the system output, the block 29 then checking to see if the user starts to speak in the immediate following period. Rather than completely ignoring user speech during certain times, the barge-in control can be arranged to reduce the responsiveness of the input channel so that the risk of a barge-in being wrongly identified is minimized. If barge-in is permitted at any stage, it is preferable to require the recognizer to have ‘recognized’ a portion of user input before barge-in is determined to have occurred. However if barge-in is identified, the dialog manager can be set to stop immediately, to continue to the end of the next phrase, or to continue to the end of the system's turn.
Whatever its precise form, the speech system can be located at any point between the user and the speech application script server. It will be appreciated that whilst the FIG. 1 system is useful in illustrating typical elements of a speech system, it represents only one possible arrangement of the multitude of possible arrangements for such systems.
Because a speech system is fundamentally trying to do what humans do very well, most improvements in speech systems have come about as a result of insights into how humans handle speech input and output. Humans have become very adapt at conveying information through the languages of speech and gesture. When listening to a conversation, humans are continuously building and refining mental models of the concepts being conveyed. These models are derived, not only from what is heard, but also, from how well the hearer thinks they have heard what was spoken. This distinction, between what and how well individuals have heard, is important. A measure of confidence in the ability to hear and distinguish between concepts, is critical to understanding and the construction of meaningful dialogue.
In automatic speech recognition, there are clues to the effectiveness of the recognition process. The closer competing recognition hypotheses are to one-another, the more likely there is confusion. Likewise, the further the test data is from the trained models, the more likely errors will arise. By extracting such observations during recognition, a separate classifier can be trained on correct hypotheses—such a system is described in the paper “Recognition Confidence Scoring for Use in Speech understanding Systems”, T J Hazen, T Buraniak, J Polifroni, and S Seneff, Proc. ISCA Tutorial and Research Workshop: ASR2000, Paris, France, September 2000. FIG. 2 of the accompanying drawings depicts the system described in the paper and shows how, during the recognition of a test utterance, a speech recognizer 5 is arranged to generate a feature vector 31 that is passed to a separate classifier 32 where a confidence score (or a simply accept/reject decision) is generated. This score is then passed on to the natural language understanding component 21 of the system.
So far as speech generation is concerned, the ultimate test of a speech output system is its overall quality (particularly intelligibility and naturalness) to a human. As a result, the traditional approach to assessing speech synthesis has been to perform listening tests, where groups of subjects score synthesized utterances against a series of criteria. The tests have two drawbacks: they are inherently subjective in nature, and are labor intensive.
U.S. Pat. No. 5,966,691 describes a system that generates speech messages in response to the occurrence of certain events within the system. To provide a more natural effect the wording of the messages varies each time the messages are generated.
What is required is some way of making synthesized speech more adaptive to the overall quality of the speech output produced. In this respect, it may be noted that speech synthesis is usually carried out in two stages (see FIG. 3 of the accompanying drawings), namely:                a natural language processing stage 35 where textual and linguistic analysis is performed to extract linguistic structure, from which sequences of phonemes and prosodic characteristics can be generated for each word in the text; and        a speech generation stage 36 which generates the speech signal from the phoneme and prosodic sequences using either a formant or concatenative synthesis technique.        
Concatenative synthesis works by joining together small units of digitized speech and it is important that their boundaries match closely. As part of the speech generation process the degree of mismatch is measured by a cost function—the higher the cumulative cost function for a piece of dialog, the worse the overall naturalness and intelligibility of the speech generated. This cost function is therefore an inherent measure of the quality of the concatenative speech generation. It has been proposed in the paper “A Step in the Direction of Synthesizing Natural-Sounding Speech” (Nick Campbell; Information Processing Society of Japan, Special Interest Group 97—Spoken Language Processing—15-1) to use the cost function to identify poorly rendered passages and add closing laughter to excuse it.
It is an object of the present invention to provide a way of improving the overall quality of synthesized speech.