In automatic speech recognition (ASR), a keyword is a word that is associated with a certain substantive meaning, and typically represented by a noun or phrase. Conversely, a filler word normally follows keywords and assumes no substantial and meaningful role. A keyword is detected when starting and ending time points of a keyword are identified in speech data that is received by an electronic device. As a result of keyword detection, the speech data are determined by a keyword detection system to include multiple keywords and filler words. Existing keyword detection systems are mainly implemented based on two models, i.e., a garbage model and a phoneme/syllable recognition model.
In a keyword detection system based on the garbage model, a decoding network is used to identify the keywords in the received speech data, and the words used in the decoding network includes keywords and filler words that are linked according to a predetermined network structure. In accordance with the decoding network, the keyword detection system recognizes each part (e.g., frame) of the speech data as being associated with either a keyword or a filler word. Each recognized part of the speech data is further associated with a confidence score, and the keyword detection system uses the respective confidence score to determine whether the keyword is properly detected. Keywords that are determined to be properly detected are then outputted with their position information within the speech data.
On the other hand, a keyword detection system based on the phoneme/syllable recognition model detects keywords in the received speech data based on entire context of the speech data. Specifically, a phoneme or syllable network is outputted for the received speech data, and the keywords of the speech data are detected from the phoneme or syllable network using a context search technique.
When more than one language is involved in speech recognition, existing keyword detection systems normally require two independent phases, i.e., a language recognition phase and a keyword detection phase. During the language recognition phase, a specific language is determined for the received speech data, and during the subsequent keyword detection phase, the keywords are then determined by a keyword detection engine associated with this specific language. The detected keywords are then combined and outputted as a recognition result from the keyword detection system.
However, performance of the existing keyword detection system involving two or more languages is oftentimes bottlenecked by the language recognition phase. An accuracy of recognizing languages during the language recognition phase directly impacts the results of keyword detection in the keyword detection phase. In particular, accurate language recognition generally requires speech data lasting an extended length (for example, 3 to 5 seconds), and this requirement inevitably creates some obstacles for streaming keyword for subsequent keyword detection. Moreover, the existing keyword detection system is particularly inefficient when keywords of multiple language are mixed up together in one sentence (e.g., in speech data associated with “highhigh”), and thereby causes inaccurate recognition of languages and keywords. Therefore, there is a need for accurately detecting keywords in speech that involves two or more languages.