1. Field of the Invention
The present invention relates to spoken dialog systems and more specifically to system and method of building application-dependent text-to-speech custom voices.
2. Introduction
State-of-the-art spoken dialog systems include several components that enable the system to understand speech spoken by a user, generate a meaningful response, and then audibly speak the response. These basic components of such a system 100 are shown in FIG. 1. They typically include an automatic speech recognition (ASR) module 112 that receives speech from a user 110, a spoken language understanding (SLU) module 114 that receives text from the ASR module 112 and identifies a meaning or intent in the speech, a dialog management (DM) module 116 that receives the user intent and determines the substance of a response to the user, a language generation (LG) module 118 that generates the text of the response to the user and transmits the text to the text-to-speech (TTS) module 120 that generates the spoken response that the user 110 hears. The present invention relates to the TTS module and to the process of creating voices used by the TTS module to speak to the user.
The method for generating a TTS voice usually involves a costly and time-consuming process. For example, if a person desires to have their voice be used for a TTS voice in a spoken dialog system, several steps are typically necessary to build the custom voice. First, a developer selects text material for reading by the person. The text may relate to a specific domain of the spoken dialog system. An example of such a system may be a travel reservation system. The person would then be given text that relates to the context or domain of travel reservations, i.e., “what is your destination city?” The process of creating the custom voice then involves recording a speech corpus of the person to obtain data from which to generate the custom voice. This typically involves recording 10-20 hours of the person speaking or reading the selected text, and processing the speech to obtain an inventory of speech units that can be concatenated together to create a TTS voice. This is a very computationally intensive process and a time consuming process. For example, the time to build such a custom voice may take a month or more. In addition, the human expertise and professional interaction necessary to build such a custom voice is high. Significant human effort is required to create the custom voice.
The cost for such a process is prohibitive as well. The high cost to a potential buyer of a custom voice to collect the speech, label the speech and build the custom voice using the above-described approach prevents many companies from deploying a spoken dialog service. To accommodate for this cost, some companies use recorded prompts in a spoken dialog system. This approach, however, dramatically limits the flexibility and adaptability of the spoken dialog service to new questions and new interactions with the users. The cost to record enough prompts to handle every scenario also becomes time consuming and cost prohibitive.
What is needed in the art is a more efficient and less expensive approach to generating a custom, in-domain TTS voice.