Keyword detection is an important part of speech recognition, for which the greatest obstacle is the number of spoken languages. A recognition engine can achieve more accurate recognition for a specified language, leading to less accurate recognition of keyword detection for varying (or multiple) language audio.
Based on the above, a keyword detection method based on audio samples has emerged; however, such method need not specify a language of the audio to be detected. The keyword detection method based on audio samples is described as follows:
First, use the audio data of a specific languages, and, after training with the audio data, a neural network of phonetic level posterior probabilities is obtained.
Next, when an audio sample to be detected is obtained, use the neural network to obtain a characteristic sequence corresponding to the audio sample. In particular, the characteristic sequence is a representative form of the audio sample (e.g., one or more posterior probabilities are obtained from the neural network).
Finally, use a sliding window to gradually conduct the same phase shift backward on the characteristic sequence. Every time the same phase shift is conducted, continue to use the neural network to obtain a characteristic representation in each sliding window and, then, use a Dynamic Time Warping (DTW) algorithm to carry out a warp comparison. In the case of a conforming characteristic representation, output the detected keyword.
Now the description of the conventional keyword detection method based on audio sample is complete.
In the aforementioned keyword detection method based on audio samples, the characteristic representation and characteristic sequence are extracted based on the neural network with a certain robustness. In addition, this method uses the DTW algorithm in combination with the sliding window technique to detect the keyword; however, the DTW algorithm is typically used by early stage speech recognition and is mainly applicable to isolated word speech recognition. The core idea of the DTW algorithm is that based on the dynamic programming, it directly compares the audio characteristics at the characteristic level, so the realization is simple and the processing rate high.
However, due to changing external factors, such as age, emotion, environment, health, and the like, pronunciation also changes. So, with the DTW algorithm, environmental noise frequently leads to a significantly decreased keyword detection accuracy.
Moreover, with the conventional keyword detection method, the training of the neural network is based on a single language. Thus, when carrying out keyword detection on an audio sample of a specified language, the conventional keyword detection method will perform successfully; however, after extending keyword detection to other languages, it is hard to obtain equivalent keyword detection performance for audio in unspecified (e.g., non-trained) languages.