1. Field of the Invention.
The present invention relates, in general, to voice recognition, and, more particularly, to software, systems, software and methods for performing voice and speech recognition over a distributed network.
2. Relevant Background
Voice and speech recognition systems are increasingly common interfaces for obtaining user input into computer systems. Speech recognition is used to provide enhanced services such as interactive voice response (IVR), automated phone attendants, voice mail, fax mail, and other applications. More sophisticated speech recognition systems are used for speech-to-text conversion systems used for dictation and transcription.
Voice and speech recognition systems are characterized by, among other things, their recognition accuracy, speed and vocabulary size. High speed, accurate, large vocabulary systems tend to be complex and so require significant computing resources to implement. Moreover, such systems have increased training demands to develop accurate models of users"" speech patterns. In applications where computing resources are limited or the ability to train to a particular user""s speech patterns is limited, speech recognition products tend to be slow and/or inaccurate. Currently, speech recognition enabled software applications must often compromise between complex but accurate solutions, or simple but less accurate solutions. In many applications, however, the impracticality of meaningful training dictates that the application can only implement less accurate techniques.
Voice recognition is of two basic types, speaker-dependent and speaker-independent. A speaker dependent system operates in environments where the system has relatively frequent contact with each speaker, where sizable vocabularies are involved, and where the cost of recognition errors is high. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker-adaptive or speaker-independent systems. In a speaker-dependent system, a user trains the system by, for example, providing speech samples and creating a correlation between the samples and text of what was provided, usually with some manual effort on the part of the speaker. Such systems often use a generic engine coupled with substantial data files, called voice models, that characterize a particular speaker for which the system has been trained. The training process can involve significant effort to obtain high recognition rates. Moreover, the voice model files are tightly coupled to the recognition software so that it is difficult to port the training investment to other hardware/software platforms.
A speaker independent system operates for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. However, they are highly useful in a wide variety of applications where many users must use the system such as answering services, interactive voice response (IVR) systems, call processing centers, data entry and the like. Such applications sacrifice the accuracy of speaker-dependent systems for the flexibility of enabling a heterogeneous group of speakers to use the system. Such applications are characterized in that high recognition rates are desirable, but the cost of recognition failure is relatively low.
A middle ground is sometimes defined as a speaker adaptive system. A speaker adaptive system dynamically adapts its operation to the characteristics of new speakers. These systems are more akin to speaker-dependent models, but allow the system to be trained over time. Adaptive systems can improve their vocabulary over time and result in complex, but accurate speech models. Such systems still require significant training effort, however. As in speaker-dependent systems, the complex speech models cannot be readily ported to other systems.
Training methods tend to be very product specific. Moreover, the data structures in which the relationships between a user""s speech and text are correlated tend to be product specific. Hence, the significant training effort applied to a first speech recognition program may not be reusable for any other program or system. In some cases, speakers must re-train systems between version updates of the same program. Temporary or permanent changes to a user""s voice patterns affect performance and may require retraining. This significant training burden and lack of portability between products has worked against wide scale adoption of speech recognition systems.
Moreover, even where a user has trained one or more speaker-dependent systems, this training effort cannot be leveraged to improve the performance of the many speaker-independent systems that are encountered. The speaker-independent systems cannot, by design, access or use speaker-dependent speech models to improve their performance. Hence, a need exists for improved speech recognition systems, software and methods that enable portable speech models that can be used for a wide variety of tasks and leverage the training efforts across a wide variety of systems.
The dichotomy between speaker-dependent and speaker-independent technologies has resulted in an interesting dilemma in industry. Many of the applications that could benefit most from accurate speech recognition (e.g., interactive voice response systems) cannot afford the complexity of highly accurate speaker dependent systems, nor obtain the necessary voice models that would improve their accuracy. From a practical perspective, speakers will only invest the significant time required to develop a high quality voice model in applications where the result is worth the effort. The benefits realized by a business cannot compel individual speakers to submit to the necessary training regimens. Hence, these applications settle for speaker-independent solutions and invest heavily in improving the performance of such systems.
Increasingly, computer-implemented applications and services are targeting xe2x80x9cthin clientsxe2x80x9d or computers with limited processing power and data storage capacity. Such devices are cost effective means of implementing user interfaces. Thin clients are becoming prominent in appliances such as televisions, telephones, Internet terminals and the like. However, the limited computing resources make it difficult to implement complex functionality such as voice and speech recognition. A need exists for voice processing systems, methods and software that can provide high quality voice processing services with reduced hardware requirements.
In the past, computers were used by one user, or perhaps a few users, to access a limited set of applications. As computers are used more frequently to provide interfaces to everyday appliances, the need to adapt user interfaces to multiple users becomes more pressing. Voice processing, in particular, represents a user input mode that is difficult to adapt to multiple users. In current systems, a voice model must be developed on and stored in each machine for each user. Not only does this tax the machine""s resources, but it creates a burdensome need for each user to train each computer that they use.
Conversely, each user tends to access computer resources via a variety of computer-implemented interfaces and computing hardware. It is contemplated that any given user may wish to access voice-enabled television, voice-enabled software on a personal computer, voice-enabled automobile controls, and the like. The effort to train and maintain each of these systems individually becomes significant with only a few applications, and prohibitive with the large number of applications that could potentially become voice enabled.
Hence, a need exists for speech recognition systems, methods and software that provide increased accuracy with reduced cost. Moreover, there is a need for systems that require reduced effort on the part of the speaker. Further, a need for systems and software that enable users to leverage training effort across multiple, disparate speech-recognition enabled applications exists.
Briefly stated, the present invention involves a speech recognition system in which one or more speaker-dependent voice signatures are developed for each of a plurality of speakers. A plurality of configurable speech processing engines are deployed and integrated with computer applications. A session is initiated between the configurable engine and a particular speaker. The configurable engine identifies the user using voice recognition or other explicit or implicit user-identification methods. The configurable engine accesses a copy of the speaker dependent voice signature associated with the identified speaker to perform speaker-dependent speech recognition.
In another aspect, the present invention involves voice signatures that are configured to integrate with and be used by a plurality of disparate voice-enabled applications. The voice signature comprises a static data structure or a dynamically adapting data structure that represents a correlation between a speaker""s voice patterns and language constructs. The voice signature is preferably portable across multiple computer hardware and software platforms. Preferably, a plurality of voice signatures are stored in a network accessible repository for access by voice-enabled applications as needed.