1. Field of the Invention
This invention relates to a method and apparatus for using computer generated voice. In particular it relates to a method and apparatus for using computer generated voice in an interactive voice response system.
2. Background of the Invention
A typical business interaction between a user and a business agent involves the agent talking to the user, asking questions, entering responses into a computer, and reading information to the user from a terminal screen. Such an interaction can be used to place a catalogue order; check an airline schedule; query a price; review an account balance; notify a customer; and record and retrieve a message. For logical processes this can be automated by substituting the agent with an interactive voice response system (IVR) having an ability to play voice prompts and receive user input by speech recognition or from DTMF tones.
An interactive voice response system is typically implemented using a client server configuration where the telephone interface and voice application run on the client machine and a voice data supply server software such as text-to-speech or a voice prompt database runs on a server with a local area network connecting the two machines. When the voice application requires voice data it requests a voice server to start streaming the voice data to the client. The client waits until a certain amount of voice data has been accumulated in a buffer and then plays voice data on an open telephone channel.
Voice applications will often have both static and dynamic data. Whilst it would be recommended to develop such applications using pure text-to-speech (TTS) technologies or pure pre-recorded audio segments, in reality many applications will have a mixture of pre-recorded audio segments and TTS sentences. The solution typically increases the naturalness of the application due to pre-recorded segments. Unfortunately it is often the case that the vocal features of the pre-recorded features significantly differ from those of the TTS engine settings. These differences could be the tone of the voice, the speed and the pitch. The overall result is an application where the user identifies the pre-recorded segments from the synthesized ones leading to user fatigue and potentially miscomprehension.