Speech recognition systems, which can generally be defined as a set of computer-implemented algorithms for converting a speech or voice signal into words, are used in a wide variety of applications and contexts. For instance, speech recognition technology is utilized in dictation applications for converting spoken words into structured, or unstructured, documents. Phones and phone systems, global positioning systems (GPS), and other special-purpose computers often utilize speech recognition technology as a means for inputting commands (e.g., command and control applications).
Speech recognition systems can be characterized by their attributes (e.g., speaking mode, speaking style, enrollment or vocabulary), which are often determined by the particular context of the speech recognition application. For example, speech recognition systems utilized in dictation applications often require that the user of the application enroll, or provide a speech sample, to train the system. Such systems are generally referred to as speaker-dependent systems. Speaker-dependent speech recognition systems support large vocabularies, but only one user. On the other hand, speaker-independent speech recognition systems support many users without any enrollment, or training, but accurately recognize only a limited vocabulary (e.g., a small list of commands, or ten digits). Speech recognition systems utilized in dictation applications may be designed for spontaneous and continuous speech, whereas speech recognition systems utilized for recognizing voice commands (e.g., voice dialing on mobile phones) may be designed to recognize isolated-words.
Because speech recognition systems generally tend to be customized for use in a particular context or application, any user-specific, location-specific, or customized data (e.g., speaker-dependent enrollment/training data, or user-specific settings) that are generated for use with one speech recognition system or application are not easily shared with another speech recognition system or application. For example, many speech recognition systems in use with dictation applications can be improved over time with active or passive training. As errors are identified and corrected, the speech recognition system can “learn” from the identified errors, and prevent such errors from reoccurring. Other speech recognition systems benefit from user specific settings, such as settings that indicate the gender of the speaker, nationality of the speaker, age of the speaker, etc. Unfortunately, such data (enrollment/training data and user settings) are not easily shared amongst speech recognition systems and/or applications.
In addition, speaker-dependent speech recognition systems do not work well in multi-user or conversational settings, where more than one person may contribute to a speech recording or voice signal. For instance, a typical speaker-dependent speech recognition system is invoked with a particular voice or user profile. Accordingly, that particular user profile is used to analyze the entire voice recording. If a recording includes recorded speech from multiple users, a conventional speaker-dependent speech recognition system uses only one user profile to analyze the recorded speech of all of the speakers.
Furthermore, even in a particular speech recognition system is configured to use multiple voice profiles, there does not exist a system or method for easily locating and retrieving, the necessary voice profiles. For example, if a particular audio recording includes speech from persons A, B, C, D, E and F (where person A is the user of the speech recognition system) then it is necessary for the speech recognition system to locate and retrieve the voice profiles from persons A, B, C, D, E and F. It may be impractical to expect person A, the user of the speech recognition system, to ask each of persons B, C, D, E and F to provide a voice profile, particularly if person A is not acquainted with one or more persons. For example, if person A is directly acquainted with persons B and C, but only indirectly acquainted with persons D, E and F, then it may be awkward, or inconvenient, to ask persons D, E and F for their voice profiles. Moreover, as the number of people in the particular group increases, the time and energy required to gain access to the voice profiles becomes prohibitive. Consequently, there exists a need for an improved architecture for sharing voice profiles.