In telephony applications, text-to-speech (TTS) systems may be utilized in the production of speech output as part of an automatic dialog system. Typically during a call session, automatic dialog systems first transcribe the words communicated by a caller through an automatic speech recognition (ASR) engine. A natural language understanding (NLU) unit in communication with the speech recognition engine is used to uncover the meanings of the caller's words. These meanings may then be interpreted to determine requested information, which may be retrieved from a database by a dialog manager. The retrieved information is passed to a natural language generation (NLG) block, which forms a sentence in response to the caller. The sentence is then output, or spoken, to the caller through a TTS speech synthesis system.
A TTS system may be utilized in many current real world applications as a part of an automatic dialog system. For example, a caller to an air travel system may communicate with a TTS system to receive air travel information, such as reservations, confirmations, schedules, etc., in the form of TTS generated speech.
The information passed from the NLG to the TTS speech synthesis system is fed in a time-critical fashion. Unfortunately, the output incurs a compounded latency comprising the processing latencies of the ASR, NLU and NLG processors. Delays between the end of the caller's statement and the output, or spoken reply to the caller, may lead to confusion or frustration on the part of the caller.
Typically, delays or latencies are masked by playing “earcons”, such as, for example, music. Such earcons inform the caller that the system is processing. However, the caller may find the earcons annoying or unnatural.
Therefore, it is desirable for an automatic dialog system to act similar to a human speaker by masking latency in a more natural manner that does not confuse, frustrate or annoy the caller.