Speech recognition (SR), also known as Automatic Speech Recognition (ASR), Speech to Text (S2T) or other names, belongs to a large family of audio analysis techniques, used for automatically identifying and extracting information from audio signals. Such techniques may include user recognition, user verification, user identification, emotion analysis, word spotting, and continuous speech recognition which refers to translating spoken words into text.
Some SR engines require specific user training in which an individual speaker reads aloud sections of text into an SR system in order to recognize the user's voice and obtain its characteristics for future recognition. However, such training is not always feasible and it is often required to transcribe voices of unknown or unrecognized speakers in which even the language or the accent may not be a-priori known. Such systems may be referred to as “speaker independent”.
A main obstacle in recognizing speech relates to the computation complexity involved in current methods, which is tightly related to the recognition quality. Recognizing spoken words at high quality, i.e., low error rate, requires significant computing resources or significant processing time. Therefore, in order to process large volume of audio and retrieve the spoken words, efficient methods are required. For example, if a call center having hundreds or thousands of agents simultaneously speaking with customers is required to transcribe a significant part of the captured or recorded calls, then in order to obtain meaningful results with reasonable resources, processing an audio signal should take no more than a very small fraction of the length of the signal.
One of the stages of common S2T methods relates to identifying the most probable phoneme sequence that may be obtained from the input audio signal. This stage is particularly time consuming and its complexity may have significant effect on the performance of the whole process.