Technical Field
This invention relates to speech recognition, and more particularly to an integrated sensor-array processor and method for use in various speech-enabled applications.
Background Information
Throughout this application, various publications, patents and published patent applications are referred to by an identifying citation. The disclosures of the publications, patents and published patent applications referenced in this application are hereby incorporated by reference into the present disclosure.
Sensor reception of signals originating in a 3D environment are often corrupted by noise and interference. For example, a microphone that acquires speech audio from a human speaker in a noisy room will contain noise and interference. The noise and interference often limits the usability of the audio signal for many applications such as automatic speech recognition (ASR). For example, it is well known that ASR success rates are very low (<20%) for voices that are distant from microphone (>1 m), in rooms with high reverberation. The performance is worse when interference from other locations is simultaneously adding to the microphone sensor input signals. Such interference can be generated by air conditioning vents on the floor or ceiling, a fireplace fan, a set of surround speakers with music or speech playback signals, or even other human speakers talking simultaneously. This problem also occurs in other domains such as sonar, radar, and ultrasonic sensing.
Using an array of sensors may improve the reception when the sensor signals are filtered using a weighted sum, e.g., using weights (or coefficients) designed to amplify the target signal by weighting time delay differences of the signal arrival. Because the sensor locations are spatially separated, the time-delays can be used to separate and either amplify or reduce signals coming from different directions. An ideal filter may be able to amplify signals coming from a target location, and completely reject interference signals coming from other locations. However, those skilled in the art will recognize that ideal filters can never be realized in practice due to fundamental signal processing and physics principles that limit the ability to completely separate signals in space or time.
Methods to improve sensor array filters include using transformations (transforms) that convert time-domain signals into frequency domain and allow specific filters for each frequency component of the sensor input signals. A frequency domain sensor array filter will have a set of sensor weights (coefficients) for each frequency bin of the transform. This isolates signal behavior and provides the ability to individually apply and tune the filtering and other processing to the signal energy in that specific frequency. This is known to significantly improve filtering performance and other types of processing too. However, the complexity and computational cost of frequency domain processing can be significantly higher than processing in the time domain. In particular, the additional latency of frequency domain processing versus time-domain processing is significant. For example, the Fourier Transform, and one of its embodiments, the Fast Fourier Transform (FFT) can add more than 2N samples of latency, where N is the block of time-samples the FFT transforms into complex frequency data values (complex referring to the real and imaginary component) and the Inverse FFT requires another N samples to convert back into the time-domain. In contrast, a time-domain filter can be as low as 0 or 1 sample (but with lower filtering performance).
Latency can be reduced by taking the FFT at a faster frame rate, allowing overlap of the signals in the blocks. For example, taking an FFT every N/4 samples would have 25% new samples and 75% older samples in its transform result. This can lower latency to 2*N/4, but now the computation cost has increased 4×. Furthermore, other processing that may be used to improve filtering, such as adaptive filtering, multichannel acoustic echo cancellation, and source localization, would all have to operate at this higher rate.
The FFT example also illustrates a problem with uniform frequency spacing in that every transform has N bins, meaning the frequency resolution is the input sample rate/N. For many applications that require high resolution in some frequencies (i.e. 1024 to 16K), a particularly large computation cost is incurred when oversampling frame rates.
Accordingly, it would be advantageous to use more efficient and flexible transforms that allow non-uniform frequency spacing and frame rates across the frequency bins (referred to hereinbelow as “transform bins”). Furthermore, it would be advantageous to use a transform approach that reduces the computation cost of implementation in FPGA hardware, ASIC hardware, embedded DSP firmware, and/or in software and when higher frame rates and non-uniform frequency spacings are used. This may enable flexibility to tune the resolution using higher or lower frequency spacings where needed. This may also lead to a sensor array processing solution with relatively low latency while maintaining advantages of transform domain processing. Resulting transform-domain processing efficiency improvements may enable other processing to be integrated more closely with the filtering to enhance performance while maintaining relatively low system latency.