This invention is in the field of active sensing of audio inputs. Embodiments are directed to the detection of particular features in sensed audio.
Recent advancements in semiconductor manufacturing and sensor technologies have enabled new capabilities in the use of low power networks of sensors and controllers to monitor environments and control processes. These networks are being envisioned for deployment in a wide range of applications, including transportation, manufacturing, biomedical, environmental management, safety, and security. Many of these low power networks involve machine-to-machine (“M2M”) communications over a wide-area network, such a network now often referred to as the “Internet of Things” (“IoT”).
The particular environmental attributes or events that are contemplated to serve as input to sensors in these networks are also wide-ranging, including conditions such as temperature, humidity, seismic activity, pressures, mechanical strain or vibrations, and so on. Audio attributes or events are also contemplated to be sensed in these networked systems. For example, in the security context, sensors may be deployed to detect particular sounds such as gunshots, glass breaking, human voices, footsteps, automobiles in the vicinity, animals gnawing power cables, weather conditions, and the like.
The sensing of audio signals or inputs is also carried out by such user devices as mobile telephones, personal computers, tablet computers, automobile audio systems, home entertainment or lighting systems, and the like. For example, voice activation of a software “app” is commonly available in modern mobile telephone handsets. Conventional voice activation typically operates by detecting particular features or “signatures” in sensed audio, and invoking corresponding applications or actions in response. Other types of audio inputs that can be sensed by these user devices include background sound, such as whether the user is an office environment, restaurant, in a moving automobile or other conveyance, in response to which the device modifies its response or operation.
Low power operation is critical in low-power network devices and in battery-powered mobile devices, to allow for maximum flexibility and battery life, and minimum form factor. For example, it has been observed that some types of sensors, such as wireless environmental sensors deployed in the IoT context, can use a large fraction of their available power on environmental or channel monitoring while waiting for an anticipated event to occur. This is particularly true for acoustic sensors, considering the significant amount of power typically required in voice and sound recognition. Conventional sensors of this type typically operate according to a low power, or “sleep,” operating mode in which the back end of the sensor assembly (e.g., the signal transmitter circuitry) is effectively powered down pending receipt of a signal indicating the occurrence of the anticipated event. While this approach can significantly reduce power consumption of the sensor assembly, many low duty cycle systems in which each sensor assembly spends a very small amount of time performing data transmission still consume significant power during idle periods, so much so as to constitute a major portion of the overall power budget.
FIG. 1 illustrates a typical conventional sound recognition system 300, for example as applied to the detection of human speech. Sounds 310 from the surrounding environment are received by microphone 312 of recognition system 300, and are converted to an analog signal. Analog to digital converter (ADC) 322 in analog front end (AFE) stage 320 of system 300 converts this analog input signal to a digital signal, specifically in the form of a sequence of digital samples 324. As fundamental in the art, the sampling rate of ADC 322 exceeds the Nyquist rate of twice the maximum frequency of interest. For typical human speech recognition systems for which sound signals of up to about 20 kHz are of interest, the sample rate will be at least 40 kHz.
Digital logic 330 of system 300 converts digital samples 324 to sound information (D2I) in this conventional system 300. Digital logic 330 is typically realized by a general purpose microcontroller units (MCU), a specialty digital signal processor (DSP), an application specific integrated circuit (ASIC), or another type of programmable logic, and in this arrangement partitions the samples into frames 340 and then transforms 342 the framed samples into information features using a defined transform function 344. These information features are then mapped to sound signatures (I2S) by pattern recognition and tracking logic 350.
Recognition logic 350 is typically implemented by one or more types of known pattern recognition techniques, such as a Neural Network, a Classification Tree, Hidden Markov models, Conditional Random Fields, Support Vector Machine, etc., and operates in a periodic manner as represented by time points t0 360, t1 361, t2 362, etc. For example, each information feature (e.g., feature 346) generated by transformation 342 is compared to a database 370 of pre-identified features. At each time step, recognition logic 350 attempts to find a match between a sequence of information features produced by transformation logic 342 and a sequence of sound signatures stored in data base 370. Each candidate signatures 352 that is identified is assigned a score value indicating the degree of match between it and features in database 370. Those signatures 352 having a score exceeding a threshold value are identified by recognizer 300 as a match with a known signature.
Because the complex signal segmentation, signal transformation and final pattern recognition operations are performed in the digital domain in recognition system 300, high-performance and high-precision realizations of ADC 322 and the rest of analog-front-end (AFE) 320 are required to provide an adequate digital signal for the following complex digital processing. For example, audio recognition of a sound signal with an 8 kHz bandwidth by a typical conventional sound recognition system will require an ADC with 16-bit accuracy operating at a sample rate of 16 KSps (samples per second) or higher. In addition, because the raw input signal 310 is essentially recorded by system 300, that signal could potentially be reconstructed from stored data, raising privacy and security issues.
Furthermore, to mitigate the problem of high power consumption in battery powered applications, system 300 may be toggled between normal detection and standby operational modes at some duty cycle. For example, from time to time the whole system may be turned on and run in full-power mode for detection, followed by intervals in low-power standby mode. However, such duty cycled operation increases the possibility of missing an event during the standby mode.
By way of further background, U.S. Patent Application Publication No. 2015/0066498, published Mar. 5, 2015, commonly assigned herewith and incorporated herein by this reference, describes a low power sound recognition sensor configured to receive an analog signal that may contain a signature sound. In this sensor, the received analog signal is evaluated using a detection portion of the analog section to determine when background noise on the analog signal is exceeded. A feature extraction portion of the analog section is triggered to extract sparse sound parameter information from the analog signal when the background noise is exceeded. An initial truncated portion of the sound parameter information is compared to a truncated sound parameter database stored locally with the sound recognition sensor to detect when there is a likelihood that the expected sound is being received in the analog signal. A trigger signal is generated to trigger classification logic when the likelihood that the expected sound is being received exceeds a threshold value.
By way of further background, U.S. Patent Application Publication No. 2015/0066495, published Mar. 5, 2015, commonly assigned herewith and incorporated herein by this reference, describes a low power sound recognition sensor configured to receive an analog signal that may contain a signature sound. In this sensor, sparse sound parameter information is extracted from the analog signal and compared to a sound parameter reference stored locally with the sound recognition sensor to detect when the signature sound is received in the analog signal. A portion of the sparse sound parameter information is differential zero crossing (ZC) counts. Differential ZC rate may be determined by measuring a number of times the analog signal crosses a threshold value during each of a sequence of time frames to form a sequence of ZC counts and taking a difference between selected pairs of ZC counts to form a sequence of differential ZC counts.