1. Field of the Invention
The present invention relates generally to voice-to-text transcription and more particularly, to voice-to-text transcription for pervasive devices, instant messengers, and web browsers over a distributed environment.
2. Background of the Invention
With the growing popularity of pervasive devices (e.g., palm-tops, personal digital assistants (PDAs), cellular telephones, smart-phones, etc.) and the increasing bandwidth for wired and wireless communications, it is becoming more and more feasible to enable intelligent applications that provide more sophisticated services. Usually, these pervasive devices have the following features: they are physically small, have limited memory and computational power, and wirelessly communicate with other devices or systems.
Instant message clients, which include the AOL, MSN and Yahoo instant message services, and the like, are prevalent in the marketplace to provide real time communication using text among the different end-users. One of the efficient methods of input is using voice transcription. Rather than to make the instant message client heavy to support transcription, we could dispatch the transcription task to the server to reduce the resource requirement and consumption at the client side.
Web-browser client devices, which include kiosks, personal computers, notebook computers, Internet appliance, and the like, are prevalent in the marketplace. Many web-browser client devices depend on remote resources for computation and storage functions and do not have the capacity themselves to store the sophisticated software and run the applications of the software.
One such sophisticated application is voice-to-text transcription, where a user can simply speak to the pervasive, instance message client through a lightweight voice plug-in or web-browser client device and the recorded audio stream is processed and transcribed to a text format. The versatile, memory-efficient, text format can then be saved, transmitted to other devices, printed, or any of several other similar functions. However, accurately converting an audio voice stream to text is a complicated process. This process is further complicated by varying dialects, inflections, accents, and other speech characteristics of users.
In order to get more accurate transcription results, the solution needs to be personalized for the end-user. Several prior-art techniques utilize stored, trained, voice profiles. A trained voice profile is a conversion table that matches a user's vocal characteristics to known letter sounds. The profile is usually created by having the user utter a series of pre-selected words. The user's voice is then cross-referenced to the letter sounds. A transcription engine then employs the trained voice profile to produce a more accurate conversion from voice to text.
As the resolution of the profile increases, so too does its size and required system resources. Similarly, the more sophisticated the transcription engine, the more system resources that are required to execute the transcription tasks. To this end, it is impractical for a pervasive device, instant messenger, or web browser to store the trained voice profile and execute the transcription itself.
Several prior-art methods have been to transmit audio-voice data from the pervasive devices or web browsers to a central server containing a transcription engine that performs the arduous computations needed for accurate transcription service. However, as the number of users grows, so too does the demand on the central transcription server, which has finite resources available for the transcription tasks. Additionally, as the geographic location of the users expands, the use of a single centralized transcription server becomes impractical.
Accordingly, a need exists for a solution to enable sophisticated voice applications on low-end pervasive, instant message, and web-browser devices that scale with the number of users as well as the geographic locations of the end-users.