1. Field of the Invention
The present invention relates to voice interfaces for computer systems. More specifically, the present invention relates to a method and an apparatus that facilitates providing concatenative audio output from a computer system.
2. Related Art
Globalization of software applications is emerging as a business necessity in today's increasingly interconnected marketplace. This interconnectedness, coupled with a soft economy, provides a valuable opportunity to companies that can efficiently and effectively provide their software to the largest audience. Far too often, globalization is an afterthought in the application development cycle—composed of ad hoc processes and frameworks grafted onto the final stages of the implementation process. Companies that undertake a globalization effort in this ad hoc fashion are likely to rewrite their applications for each language, or worse, fail to ship software in multiple languages altogether.
Nowhere is this truer than in the speech technology world. The unique challenges posed by voice application development are daunting even for single-language development. Adding multiple languages to the mix and trying to maintain the ideal of a single code base and simultaneous shipment for all languages only makes the task harder. A variety of methods and processes exist to facilitate globalization of screen-based applications, but unfortunately, these methods and processes fall short (on their own) to address the needs of voice applications developers.
One challenge in globalizing voice application is generating language and locale-specific voice output from an application. The process of generating speech output from a computer typically involves using a text-to-speech (TTS) system to convert text to speech on a word-by-word basis.
Unfortunately, TTS systems have many drawbacks. The audio output from TTS systems is not realistic because words are disconnected and, quite often, are pronounced with improper inflection. This is because a TTS system does not use contextual information from the phrases that make up human speech to properly modulate pronunciation and inflection. An additional problem involves numbers, dates, times, and the like. A typical TTS system reads these data items as individual numbers rather than a coherent unit. For example, a typical TTS system may read “125” as one, two, five, rather than one hundred and twenty five.
A TTS system also tends to be language and locale-specific. Hence, an extensive rework is typically required to change the TTS system to a different locale. While many techniques exist to change visual interfaces from locale to locale, these methods are not available to developers of speech interfaces.
Hence, what is needed is a method and an apparatus for supplying realistic concatenative audio from a computer system without the drawbacks described above.