The present invention generally relates to adapting speech recognition to a user's speech, and more particularly to adapting speech recognition to a user's speech by concatenating utterances from the user.
Speech recognition by a computer, also known as automatic speech recognition (ASR) or speech to text (STT), may utilize two types of speech recognition models: an acoustic model and a language model. An acoustic model may rely on relationships between an audio signal and the phonetic units present in that audio signal. A language model may rely on relationships between words in a spoken sentence (i.e., word sequences in language). Speech recognition servers/systems may determine text based on the highest combined probability for both acoustic and language models. However, there may be a mismatch between the text determined by the models and the actual words in a user's speech. Such mismatches may increase for short utterances resulting in deteriorated speech recognition accuracy.
To improve speech recognition accuracy, a speech recognition system may obtain “training” (or “enrollment”) speech from the user, which the system may use to adapt a general acoustic model and/or a general language model to the user's speech. System training may involve a user reading text or isolated vocabulary into the system. Such systems are known as “speaker-dependent” systems. Systems that do not use training are known as “speaker-independent” systems.
System training and/or adaptation may occur during a single user session or across multiple user sessions. In session adaptation relies on long utterances from the user (e.g., a lecture), which the system may use to learn both acoustic information for the user and language context. Adaptation across multiple user sessions requires user identification to link multiple sessions by the user into a single, long utterance. Speech recognition systems utilizing adaptation across multiple user sessions may require large amounts of storage to store each user's utterances and/or adapted models, which may affect scalability of these systems.