The present disclosure generally relates to speech recognition, and more specifically to phrase spotting in a given speech.
Spoken phrase spotting is important functionality in a variety of different technologies such as speech recognition systems and interactive dialog systems. Generally, confidence scores are used to determine whether a word was spotted, recognized or correctly understood. In automatic speech recognition (ASR) systems, for example, confidence scores represent a confidence level of the speech recognition system that the words were correctly identified.
Automatic speech recognition techniques of the art include phonetic-based phrase spotting that commonly uses posterior probability estimation (PPE) to provide a PPE confidence score. Generally, there are two major approaches for computing PPE scores.
One approach is based on acoustic measurements, and deploys two phases on the provided speech. The first phase computes the best word hypotheses, then re-score the hypotheses to compute the PPE scores for each word in the best hypotheses. The first phase uses standard acoustic models, and the second phase uses acoustic models that normalize the log-likelihood functions. The acoustic approach requires the ASR system to provide a background model of the acoustic channel.
Another approach for computing PPE scores is a lattice-based approach, which includes using a result lattice graph to estimate all the possible paths using a forward-backward algorithm. Confidence scores can be assigned to each path and those confidence scores can be used to normalize the hypotheses score.
In certain cases, the PPE score may be unreliable or inaccurate, and it is desirable to improve the reliability of a phrase spotting confidence score produced by an ASR engine, in order to obtain more reliable or accurate phrase spotting results.