The modern communications era has brought about a tremendous expansion of wireline and wireless networks. Computer networks, television networks, and telephony networks are experiencing an unprecedented technological expansion, fueled by consumer demand. Wireless and mobile networking technologies have addressed related consumer demands, while providing more flexibility and immediacy of information transfer.
Current and future networking technologies continue to facilitate ease of information transfer and convenience to users. One area in which there is a demand to increase ease of information transfer relates to the delivery of services to a user of a mobile terminal. The services may be in the form of a particular media or communication application desired by the user, such as a music player, a game player, an electronic book, short messages, email, etc. The services may also be in the form of interactive applications in which the user may respond to a network device in order to perform a task, play a game or achieve a goal. The services may be provided from a network server or other network device, or even from the mobile terminal such as, for example, a mobile telephone, a mobile television, a mobile gaming system, etc.
In many applications, it is necessary for the user to receive audio information such as oral feedback or instructions from the network or mobile terminal or for the user to give oral instructions or feedback to the network or mobile terminal. Such applications may provide for a user interface that does not rely on substantial manual user activity. In other words, the user may interact with the application in a hands free or semi-hands free environment. An example of such an application may be paying a bill, ordering a program, requesting and receiving driving instructions, etc. Other applications may convert oral speech into text or perform some other function based on recognized speech, such as dictating a document, short message service (SMS) message or email, etc. In order to support these and other applications, speech recognition applications, applications that produce speech from text, and other speech processing devices are becoming more common.
Speech recognition, which may be referred to as automatic speech recognition (ASR), may be conducted by numerous different types of applications that may convert recognized speech into text (e.g., a speech-to-text system). Current ASR and/or speech-to-text systems are typically based on Hidden Markov Models (HMMs), which are statistical models that describe speech patterns probabilistically. In some instances it may be desirable for speech models to ignore speaker characteristics such as gender, age, accent, etc. However, in practice it is typically impractical to ignore such characteristics so speech models may model both speaker and environmental factors as well as the “pure” linguistic patterns desirable for recognition. Thus, for example, “Speaker Dependent” (SD) acoustic models that are trained for a specific speaker's voice are generally more accurate than “Speaker Independent” (SI) acoustic models which generalize over a population of different speakers. Pure SD models, however, may be inconvenient in that such models must be trained individually for each speaker. This may require that several hours worth of transcribed speech recordings be available for a given speaker.
Accordingly, there may be need to develop improved speech processing techniques that address the problems described above.