This application, and the innovations and related subject matter disclosed herein, (collectively referred to as the “disclosure”) generally concern digital signal processing techniques and digital signal processors (DSPs) implementing such techniques. More particularly but not exclusively, this disclosure pertains to speech enhancers and speech enhancement techniques for improving speech components in an observed signal, speech recognition techniques, DSPs implementing such enhancement and/or recognition techniques, and systems incorporating such speech enhancers, speech recognition techniques and/or speech enhancement techniques. As but one particular example, objective measures of perceived speech quality can be used for automatically tuning a speech enhancement system applied to an automatically generated speech database. Such a speech enhancement system can be automatically tuned over a substantial number (e.g., thousands or tens of thousands) of combinations of operating conditions (e.g., noise levels and types, full-duplex speech patterns, room impulse response), making disclosed speech enhancers, techniques, and related systems suitable for use in a variety of real-world applications involving full-duplex communications. By contrast, conventional tuning approaches using expert listeners cannot, as a practical matter, be based on such large numbers of combinations of operating conditions given the time and effort required by manual tuning. Consequently, disclosed enhancers, enhancement techniques, and systems can save substantial resources over manual tuning procedures, and can speed development and deployment cycles.
Parameters of a single-microphone speech enhancement system for hands-free devices can be formulated as a large-scale nonlinear programming problem can be selected automatically. A conversational speech database can be automatically generated by modeling interactivity in telephone conversations, and perceptual objective quality measures can be used as optimization criteria for the automated tuning over the generated database. Objective tests can be performed by comparing the automatically tuned system based on objective criteria to a system tuned by expert human listeners. Evaluation results show that disclosed tuning techniques greatly improve enhanced speech quality, potentially saving resources over manual evaluation, speeding up development and deployment time, and guiding the speech enhancer design. A speech enhancement system tuned according to disclosed techniques can improve a perceived quality of a variety of speech signals across computing environments having different computational capabilities, or limitations. Speech recognizers and digital signal processors based on such speech enhancement systems are also disclosed, together with related acoustic (e.g., communication) systems.
Speech enhancers (SE) can serve as a preprocessing stage for a variety of different speech-centric applications, for example, mobile communication, speech recognition, and hearing aids. Speech enhancement can have a fundamental role in extending the usage of such devices to scenarios with severe acoustical disturbances. Given the substantial variety of applications and use scenarios, it is often impractical to design speech enhancers capable of covering all possible interferences. Thus, finding suitable values for the parameters associated with the speech enhancement system to fit a given scenario becomes a central aspect for the proper deployment of speech-centric applications in the real world. Conventional tuning procedures have relied on subjective listening tests. Although well-trained ears may remain a reliable approach for measuring perceived quality of a system, relying on manual tuning of speech enhancers can be very time consuming and resource intensive, commonly taking longer than the design and implementation phases associated with new speech enhancers. Further, the human component in conventional tuning procedures makes them error-prone and bound to cover only a relatively small number of scenarios expected in use.
Automatic tuning of a speech enhancement system using measures such as word error rate or perceptual objective quality measure can efficiently find optimized parameters for a given system instead of relying on human expert for hand-tuning. However, when the speech enhancement system needs to be deployed on a target platform, the computational power of the platform is often limited. The automatic tuning methods in the past oftentimes do not take this limitation of the target platform into consideration.
As but one particular example, speech recognition techniques for distant-talking control of music playback devices are disclosed, together with related DSPs and associated systems.
The human interaction paradigm with music playback devices has seen a dramatic shift with their increased in portability and miniaturization. Well-established interaction media like remote controls are no longer an option and new solutions are needed. Automatic speech recognition (ASR) interfaces offer a natural solution to this problem, considering also the hands-busy, mobility-required, scenarios where these devices are typically used. These scenarios make the ASR technology embedded in these small devices particularly exposed to highly challenging conditions, due to the music playback itself, environmental noise, and general environmental acoustics, e.g., reverberation. In particular, the level of degradation in the input signal, and the consequent drop in ASR performance, can be very significant when the distance between user and microphone increases. In the past decade, the literature on distant-talking speech interfaces suggested several solutions to the problem, e.g., the DICIT project. However, to the inventors' knowledge, the only available solutions to this problem rely heavily on microphone arrays having a plurality of microphones spaced apart from the loudspeakers to provide a relatively high signal-to-echo ratio, making their application unfeasible in portable loudspeakers, and other commercial applications.
Therefore, there remains a need for improved signal processing techniques to enhance speech. In particular, there remains a need for speech enhancers that can be tuned automatically. There also remains a need for objective measures of perceived sound quality as it relates to speech, and there remains a need for automatically generated databases of conversational speech. And, a need remains for digital signal processors implementing automatically tunable speech enhancers. There further remains a need for telephony systems, e.g., speaker phones, having such digital signal processors. A portable stereo speaker having a built-in microphone is but one example of many possible speaker phones. As well, a need remains for an optimization framework to tune a speech enhancement system to maximize the system's performance while constraining its computational complexity in accordance with one or more selected target platforms. There further remains a need for speech enhancement systems which can be tuned either as a speech recognizer front-end or as a full-duplex telephony system. In addition, a need exists for techniques to solve such a nonlinear optimization problem. Further, a need exists for such constrained optimization systems suitable for real-time implementation on one or more selected target platforms. As well, there remains a need for speech recognition techniques suitable for single-microphone systems. There also remains a need for speech-recognition techniques suitable for such systems having the microphone positioned in close proximity to one or more loudspeakers. Further, there remains a need for speech recognition techniques suitable for controlling playback of music on such systems. And, a need exists for DSPs and associated systems configured to implement such speech recognition techniques.