1. Field of the Invention
The present invention relates to text-to-speech systems and methods. Although phoneme creation and implementation has been used to create speech from text input as is known in the art, in the instant system and method a client/end-user is given the opportunity to build and upload data and recordings onto a web-based system that allows them to build and manage their voice for use in widespread applications.
2. Description of the Related Art
A speech synthesizer may be described as three primary components: an engine, a language component, and a voice database. The engine is what runs the synthesis pipeline using the language resource to convert text into an internal specification that may be rendered using the voice database. The language component contains information about how to turn text into parts of speech and the base units of speech (phonemes), what script encodings are acceptable, how to process symbols, and how to structure the delivery of speech. The engine uses the phonemic output from the language component to optimize which audio units (from the voice database), representing the range of phonemes, best work for this text. The units are then retrieved from the voice database and combined to create the audio of speech.
Most deployments of text-to-speech occur in a single computer or in a cluster. In these deployments the text and text-to-speech system reside on the same system. On major telephony systems the text-to-speech system may reside on a separate system from the text, but all within the same local area network (LAN) and in fact are tightly coupled. The difference between how a consumer and telephony system function is that for the consumer, the resulting audio is listened to on the system that did the synthesis. On a telephony system, the audio is distributed over an outside network (either wide area network or telephone system) to the listener.
For end-users of text-to-speech software the software typically (historically) resides on one of their computers. The two most commonly used computer systems for consumers provide a vendor independent API for text-to-speech. On Windows it is called SAPI and on a Macintosh it is called Apple Speech Manager. These API layers allow all text-to-speech vendors (software and) voice databases to be used interchangeably on the user's computer. These interfaces provide a common abstraction for all vendors' locally installed software.
Client/Server architecture where the text, synthesis and audio are not tightly connected exist but are rare. For example, U.S. Pat. No. 6,625,576 describes a method and apparatus for performing text-to-speech conversion wherein a client/server environment partitions an otherwise conventional text-to-speech conversion algorithm. The text analysis portion of the algorithm is executed exclusively on a server while the speech synthesis portion is executed exclusively on a client which may be associated therewith.
U.S. Pat. No. 6,604,077 shows a system and method of operating an automatic speech recognition and text-to-speech service using a client-server architecture. Text-to-speech services are accessible at a client location remote from the main automatic speech recognition engine. U.S. Pat. No. 7,313,528 teaches a text-to-speech streaming data output to an end user using a distributed network system. The TTS server parses raw website data and converts the data to audible speech.
These client/server systems all focus on synthesis and thus the relationship (proximity) of text, engine and audio output.
The engine and language front-end are constructed from software. The voice database is built from recorded speech. In the process to build a voice database a voice talent reads predetermined text. These readings are recorded. After the recording session(s) the recordings are put through a process of decomposition where each phoneme is identified and labeled (plus some additional information). These units are then put into a database for retrieval during synthesis.
While the previous paragraph makes this process appear simple it is in fact very complex and difficult. Due to the complexity this process is typically very expensive. This has the direct result of Text-to-Speech vendors (companies that produce voice databases) producing only one or two voices in each language they support. The voices are chosen for their mass appeal and to minimize risk of market acceptance. As an example, not including the Company submitting this patent, there are approximately 10 high quality U.S. English commercially available voice databases from the six (or so) TTS vendors. Each of these voices are very similar in their characteristics and almost unidentifiable from vendor to vendor.
A complete, open source set of tools and documentation for producing new voices and languages is available at the website for “festvox” for public consumption. These tools allow one to build their own voice. There have also been other attempts made to allow end-users to build voices. Due to the complexity involved—the results are rarely good enough to he considered commercially viable. It also requires a large investment of time to acquire the knowledge on how to run these systems.
Most users that would like to build their own voice do not want to use it in one of the traditional TTS markets. The traditional markets have been telephone systems and education. These domains have been satisfied with the limited selection and similarity of each vendor's offerings. Note that accessibility is one of the traditional markets and is one market where users would prefer to have their own voice or one they closely identify with.
There is a burgeoning demand for variety. As an example, the entertainment industry is not interested in the bland, robotic voice of telephony systems. There are thousands of “interesting” voices that might serve different markets, and such distinction can never be created by one entity or program. The entertainment industry can be thought to include (but not limited to) avatar based messaging services, and online games. There is also a growing demand for personalizing information as it is presented. A greater variety of voices available allows for more choice.
Phoneme sequence assemblage (as occurs during speech recognition and during the process of voice database building) done in different environments can lead to many different applications. Because open source tools are not capable of providing communication or storage platforms and certain online environments have many other limitations including end quality, stability, and graphical interfaces, it is outside anybody's internal ability to ever achieve such a scale of capturing literally all voice characteristics. The most practical way to build one's audible voice into a voice database and be able to apply that voice to literally any online environment is to give as many voice-building tools to the end user as possible and coordinate and instruct the building process remotely.
There is need then for a network based voice-building process which provides an abundance of tools and enhances the client's role. With such end-user interaction, the built voices can be highly customized to a desired level of the end-user's choosing, and of extremely realistic quality, extending the applicability of voices to targeted areas.