In recent years, the field of text-to-speech (TTS) conversion has been largely researched, with text-to-speech technology appearing in a number of commercial applications. Recent progress in unit-selection speech synthesis and Hidden Markov Model (HMM) speech synthesis has led to considerably more natural-sounding synthetic speech, which thus makes such speech suitable for many types of applications.
However, relatively few of these applications provide text-to-speech features. One of the barriers to popularizing text-to-speech in such applications is the technical difficulties in installing, maintaining and customizing a text-to-speech engine. For example, when a user wants to integrate text-to-speech into an application program, the user has to search among text-to-speech engine providers, pick one from the available choices, buy a copy of the software, and install it on possibly many machines. Not only does the user or his or her team have to understand the software, but the installing, maintaining and customizing of a text-to-speech engine can be a tedious and technically difficult process.
For example, in current text-to-speech applications, text-to-speech engines need to be installed locally, and require tedious and technically difficult customization. As a result, users are often frustrated when configuring different text-to-speech engines, especially when what many users typically want to do is only occasionally convert a small piece of text into speech.
Further, once a user has made a choice of a text-to-speech engine, the user has limited flexibility in choosing voices. It is not easy to obtain an additional voice unless without paying for additional development costs.
Still further, each multiple high quality text-to-speech voice requires a relatively large amount of storage, whereby the huge amount of storage needed to install multiple high quality text-to-speech voices is another barrier to wider adoption of text-to-speech technology. It is basically not possible for an individual user or small entity to have multiple text-to-speech engines with dozens or hundreds voices for use in applications.