A typical method of keyword detection is a three stage process. The first stage is vocalization detection. Initially, an extremely low power “always-on” implementation continuously monitors ambient sound and determines whether a person begins to utter a possible keyword (typically by detecting human vocalization). When a possible keyword vocalization is detected, the second stage begins.
The second stage performs keyword recognition. This operation consumes more power because it is computationally more intensive than the vocalization detection. When the examination of an utterance (e.g., keyword recognition) is complete, the result can either be a keyword match (in which case the third stage will be entered) or no match (in which case operation of the first, lowest power stage resumes).
The third stage is used for analysis of any speech subsequent to the keyword recognition using automatic speech recognition (ASR). This third stage is a very computationally intensive process and, therefore, can greatly benefit from improvements to the signal to noise ratio (SNR) of the portion of the audio that includes the speech. The SNR is typically optimized using noise suppression (NS) signal processing, which may require obtaining audio input from multiple microphones.
Use of a digital microphone (DMIC) is well known. The DMIC typically includes a signal processing portion. A digital signal processor (DSP) is typically used to perform computations for detecting keywords. Having some form of digital signal processor (DSP), to perform the keyword detection computations, on the same integrated circuit (chip) as the signal processing portion of the DMIC itself may have system power benefits. For example, while in the first stage, the DMIC can operate from an internal oscillator, thus saving the power of supplying an external clock to the DMIC and the power of transmitting the DMIC data output, typically, a pulse density modulated (PDM) signal, to an external DSP device.
It is also known that implementing the subsequent stages of keyword recognition on the DMIC may not be optimal for the lowest power or system cost. The subsequent stages of keyword recognition are computationally intensive and, thus, consume significant dynamic power and die area. However, the DMIC signal processing chip is typically implemented using a process geometry having significantly higher dynamic power and larger area per gate or memory bit than the best available digital processes.
Finding an optimal implementation that takes advantage of the potential power savings of implementing the first stage of keyword recognition in the DMIC can be challenging due to conflicting requirements. To optimize power, the DMIC operates in an “always-on,” standalone manner, without transmitting audio data to an external device when no vocalization has been detected. When the vocalization is detected, the DMIC needs to provide a signal to an external device indicating this condition. Simultaneously with or subsequent to the occurrence of this condition, the DMIC needs to begin providing audio data to the external device(s) performing the subsequent stages. Optimally, the audio data interface is needed to meet the following requirements: transmitting audio data corresponding to times that significantly precede the vocalization detection, transmitting real-time audio data at an externally provided clock (sample) rate, and simplifying multi-microphone noise suppression processing. Additionally, latency associated with the real-time audio data for DMICs that implement the first stage of keyword recognition needs to be substantially the same as for conventional DMICs, the interface needs to be compatible with existing interfaces, the interface needs to indicate the clock (sample) rate used while operating with the internal oscillator, and no audio drop-outs should occur.
An interface with a DMIC that implement the first stage of keyword recognition can be challenging to implement largely due to the requirement to present audio data that is buffered significantly prior to the vocalization detection. This buffered audio data was previously acquired at a sample rate determined by the internal oscillator. Consequently, when the buffered audio data is provided along with real-time audio data as part of a single, contiguous audio stream, it can be difficult to make this real-time audio data have the same latency as in a conventional DMIC or difficult to use conventional multi-microphone noise suppression techniques.