(1) Field of the Invention
The invention relates generally to means and methods of providing clear, high quality voice with a high signal-to-noise ratio to a speech recognition engine to improve its efficiency. In particular, means and methods for developing an adaptive noise reduction scheme which reduces the background noise in the front-end to improve the performance of a speech recognition engine.
Linguists, scientists and engineers have endeavored for many years to construct machines that can recognize human speech. Although in recent years this goal has begun to be realized in certain aspects, currently available systems have not been able to produce results that even closely emulate human performance. This inability to provide satisfactory speech recognition is primarily due to difficulties that are involved in extracting and identifying the individual sounds that make up the human speech. These difficulties are exacerbated in noisy environments. Simplistically, speech may be considered as a sequence of sounds taken from a set of forty of so basic sounds called “phonemes”. Different sounds, or phonemes, are produced by varying the shape of the vocal tract through muscular control of the speech articulators (lips, tongue, jaw etc.). A stream of a particular set of phonemes will collectively represent a word or a phrase. Thus, extraction of the particular phonemes contained within a speech signal is necessary to achieve voice recognition. This task becomes extremely difficult in noisy environments.
Most of the speech recognition systems use Hidden Markov Model technique to recognize speech. Markov model speech pattern templates are formed for speech analysis systems by analyzing identified speech patterns to generate frame sequences of acoustic feature signals representative thereof. The speech pattern template is produced by iteratively generating succeeding Markov model signal sets starting with an initial Markov model signal sets starting with an initial Markov model signal set.
Markov model signal set to generate a signal corresponding to the similarity there between. The iterations are terminated when said similarity signal is equal to or smaller than a predetermined value and the last formed Markov model signal set. A speech recognition system recognizes a speech uttered as one of a plurality of stored reference patterns. In a known speech recognition system, a speech uttered is converted into an input speech signal by an electromechanical transducer such as a microphone. The input speech signal is analyzed by a pattern analyzer and is converted into a digital input pattern signal. The input pattern signal is memorized in an iput memory as a memorized pattern. The memorized pattern is compared with each of the stored reference patterns in a reference memory and dissimilarity is produced there between. When a particular one of the reference patterns provides the minimum dissimilarity, the speech uttered is recognized as that particular reference pattern. Alternatively, when a specific one of reference patterns provides a specific dissimilarity smaller than a predetermined threshold, the speech uttered is recognized as the specific reference pattern.
In the actual recognition operation, the input speech signal is accompanied with noise due to presence of background sounds. The input speech signal and the noise are collectively referred to as an input signal and the noise are collectively referred to as an input sound signal. Accordingly, the input pattern signal includes a noise component. This results, in a worst case, the failure to recognize speech.
The present invention relates to a system for reducing the noise accompanying the speech uttered. It also relates to means and methods of providing clear, high quality voice with a high signal-to-noise ratio, in voice communication systems, devices, telephones, and methods, and more specifically, to systems, devices, and methods that automate control in order to correct for variable environment noise levels and reduce or cancel the environment noise prior to sending the voice communication over VoIP communication links.
As the popularity of VoIP communication systems increases, many users utilize them in a variety of environments. An increasing popular trend is to equip mobile terminals with an external microphone and speaker, allowing for “hands-free” operation. In addition, it is known to include a speech recognition device, so that the user, for example says, “Call home” by the voice command. While speech recognition technology is increasingly sophisticated, a clear separation of the voice component of an audio signal from noise components, i.e., a high Signal-to-Noise Ratio (SNR) is required for acceptable levels of accuracy in the speech recognition task. However, the movement of the microphone from adjacent the speaker's mouth, as in a hand held unit, introduces significant noise into the audio input signal. Thus, a noise reduction operation must be performed on the audio signal prior to speech recognition to obtain satisfactory results.
(2) The Related Art
Voice communication devices such as Voice over Internet Packets/Protocols telephones and other communication devices have become ubiquitous; they show up in almost every environment. People now use, or attempt to use such devices in a myriad of moderately noisy to excessively noisy environments such as airports, restaurants, bars, sporting events, movies, and concerts. The use of voice communication devices in noisy environments has lead to difficulty for listeners to discern a voice signal and has diminished network capacities as signal to noise ratios are lowered.
These systems and devices and their associated communication methods are referred to by a variety of names, such as but not limited to, voice over packets, or voice over Internet protocol or voice over Internet packets (VoIP), IP telephony, Internet telephony, and sometimes Digital IP phone.
These systems are used at home, office, inside a car, a train, at the airport, beach, restaurants and bars, on the street, and almost any other venue. As might be expected, these diverse environments have relatively higher and lower levels of background, ambient, or environmental noise. For example, there is generally less noise in a quiet home than there is in a crowded bar. If this noise, at sufficient levels, is picked up by the microphone, the intended voice communication degrades and though possibly not known to the users of the communication device, uses up more bandwidth or network capacity than is necessary, especially during non-speech segments in a two-way conversation when a user is not speaking.
Voice over Internet protocol routes voice conversations over the Internet or any other Internet Protocol (IP)-based network. The voice data flows over a general-purpose packet-switched network, instead of traditional dedicated, circuit-switched voice transmission lines. The protocols used to carry voice signals over the IP network are commonly referred to as Voice over IP or VoIP protocols. Voice over IP traffic might be deployed on any IP network, including for example, networks lacking a connection to the rest of the Internet, such as for instance on a private building-wide LAN.
The three most common quality issues affecting VoIP networks are Latency, Jitter, Packet Loss and Choppy unintelligible speech.
Latency generally refers to the physical distance that a phone call must travel to reach the service provider. When a phone call is made with VoIP, the signal is cut into thousands of little pieces, called packets, and then sent through the Internet to the service provider. These packets travel so fast that the process of traveling and reassembling them to the phone at the other end of the conversation generally takes milliseconds.
Usually, most users are not affected by latency with their VoIP providers. If the roundtrip travel time of the packet takes more than 250 milliseconds the quality of the communication may experience some issues due to latency. Most commonly, this occurs when trying to make international calls. Latency can occur in both VoIP and traditional phone systems. Of course, a variety of other factors, including congestion, can add to the overall latency of a packet.
Many VoIP providers have established multiple hosts to reduce latency and provide a quick connection from any location. One of the benefits of using VoIP over traditional phone systems is that internet speed is constantly increasing, helping to keep latency down. Additionally, many VoIP companies provide service centers located in specific areas to ensure latency is low, regardless of your location.
When packets are received with a timing variation from when they were sent, a quality issue of Jitter may be noticed. When Jitter occurs, participants on the call will notice a delay in phone conversation. Many VoIP providers reduce or eliminate Jitter by controlling for Jitter and time issues within their networking equipment. Although the overall delay impacts the quality of a voice call, another key consideration is the difference between when packets are expected to arrive and when they actually arrive—a concept known as “jitter”. While it may not make a big difference if traditional data packets are received with timing variations between packets, it can seriously impact the quality of a voice conversation, where timing is everything. In order to compensate for the fact that voice packets can be received with variable rather than constant timing, VoIP endpoints implement what is known as a “dejitter buffer” in order to change the variable delay back to the expected constant delay expected.
Jitter is a variation in packet transit delay caused by queuing, contention and serialization effects on the path through the network. In general, higher levels of jitter are more likely to occur on either slow or heavily congested links. In order to facilitate later discussion we will define several types of jitter. Type A—constant jitter. This is a roughly constant level of packet to packet delay variation. Type B—transient jitter. This is characterized by a substantial incremental delay that may be incurred by a single packet. Type C—short term delay variation. This is characterized by an increase in delay that persists for some number of packets, and may be accompanied by an increase in packet to packet delay variation. Type C jitter is commonly associated with congestion and route changes.
In VoIP systems, Packet Loss can take place when a large amount of network traffic hits the same Internet connection. When talking on a VoIP system, Packet Loss can be identified with an echo or tin-like sound. Packet Loss is most commonly measured in percentages. For VoIP use, packet loss should not exceed 1%. A one percent packet loss will result in a skip or clipping approximately once every three minutes.
In modern VoIP environments, the speech is superposed by different levels of background noise. If the SNR is 6 dB, 30% energy of the signal transmitted is noise. This results in choppy unintelligible speech.
Significantly, in an on-going VoIP phone call or other communication from an environment having relatively higher environmental noise, it is sometimes difficult for the party at the other end of the conversation to hear what the party in the noisy environment is saying. That is, the ambient or environmental noise in the environment often “drowns out” the voice over internet or voice over packets or wire lined telephone user's voice, whereby the other party cannot hear what is being said or even if they can hear it with sufficient volume the voice or speech is not understandable. This problem may even exist in spite of the conversation using a high data rate on the communication network.
Attempts to solve this problem have largely been unsuccessful. Both single microphone and two microphone approaches have been attempted. U.S. Pat. No. 7,242,765 granted to Hairston describes headset cellular telephones for voice dialing and controlling other aspects of the cell phones in an ambient noise environment, but does not deal with the cancellation of the ambient noise in VoIP environments.
U.S. Pat. No. 6,937,980 to Krasny et al describes the noise cancellation for a speech recognition engine but uses a microphone array which is difficult to implement in a VoIP phone.
U.S. Pat. No. 6,415,034 to Hietanen et al patent describes the use of a second background noise microphone located within an earphone unit or behind an ear capsule. Digital signal processing is used to create a noise canceling signal which enters the speech microphone. Unfortunately, the effectiveness of the method disclosed in the Hietanen et al patent is compromised by acoustical leakage, where the ambient or environmental noise leaks past the ear capsule and into the speech microphone. The Hietanen et al patent also relies upon complex and power consuming expensive digital circuitry that may generally not be suitable for small portable battery powered devices such as pocket able cellular telephones.
Another example is U.S. Pat. No. 5,969,838 (the “Paritsky patent”) which discloses a noise reduction system utilizing two fiber optic microphones that are placed side-by-side next to one another. Unfortunately, the Paritsky patent discloses a system using light guides and other relatively expensive and/or fragile components not suitable for the rigors of VoIP phones and other VoIP devices. Neither Paritsky nor Hietanen address the need to increase capacity in VoIP phone-based communication systems.
U.S. Pat. No. 5,406,622 to Silverberg et al uses two adaptive filters, one driven by the handset transmitter to subtract speech from a reference value to produce an enhanced reference signal; and a second adaptive filter driven by the enhanced reference signal to subtract noise from the transmitter. Silverberg et al require accurate detection of speech and non-speech regions. Any incorrect detection will degrade the performance of the system.
Previous approaches in noise cancellation have included passive expander circuits used in the electret-type telephonic microphone. These, however, suppress only low level noise occurring during periods when speech is not present. Passive noise-canceling microphones are also used to reduce background noise. These have a tendency to attenuate and distort the speech signal when the microphone is not in close proximity to the user's mouth; and further are typically effective only in a frequency range up to about 1 kHz.
Active noise-cancellation circuitry to reduce background noise has been suggested which employs a noise-detecting reference microphone and adaptive cancellation circuitry to generate a continuous replica of the background noise signal that is subtracted from the total background noise signal before it enters the network. Most such arrangements are still not effective. They are susceptible to cancellation degradation because of a lack of coherence between the noise signal received by the reference microphone and the noise signal impinging on the transmit microphone. Their performance also varies depending on the directionality of the noise; and they also tend to attenuate or distort the speech.
Known frequency domain noise reduction techniques, often introduce significant artifacts and aberrations into the speech audio component, making the speech recognition task more difficult. Hence there is a need in the art for a method of noise reduction or cancellation that is robust, suitable for VoIP use, and inexpensive to manufacture. The increased traffic in VoIP based communication systems has created a need in the art for means to provide a clear, high quality signal with a high signal-to-noise ratio.
There are several methods for performing noise reduction, but all can be categorized as types of filtering. In the related art, speech and noise are mixed into one signal channel, where they reside in the same frequency band and may have similar correlation properties. Consequently, filtering will inevitably have an effect on both the speech signal and the background noise signal. Distinguishing between voice and background noise signals is a challenging task. Speech components may be perceived as noise components and may be suppressed or filtered along with the noise components.
Even with the availability of modern signal-processing techniques, a study of single-channel systems shows that significant improvements in SNR are not obtained using a single channel or a one microphone approach. Surprisingly, most noise reduction techniques use a single microphone system and suffer from the shortcoming discussed above.
One way to overcome the limitations of a single microphone system is to use multiple microphones where one microphone may be closer to the speech signal than the other microphone. Exploiting the spatial information available from multiple microphones has lead to substantial improvements in voice clarity or SNR in multi-channel systems. However, the current multi-channel systems use separate front-end circuitry for each microphone, and thus increase hardware expense and power consumption.
The two microphone solution provides a new means and methods of increasing SNR in hand-held devices that capture sound with multiple microphones but use the circuitry or hardware of a single channel system. Adaptive noise cancellation is one such powerful speech enhancement technique based on the availability of an auxiliary channel, known as reference path, where a correlated sample or reference of the contaminating noise is present. This reference input is filtered following an adaptive algorithm, in order to subtract the output of this filtering process from the main path, where noisy speech is present.
As with any system, the two microphone systems also suffer from several shortfalls. The first shortfall is that, in certain instances, the available reference input to an adaptive noise canceller may contain low-level signal components in addition to the usual correlated and uncorrelated noise components. These signal components will cause some cancellation of the primary input signal. The maximum signal-to-noise ratio obtained at the output of such noise cancellation system is equal to the noise-to-signal ratio present on the reference input.
The second shortfall is that, for a practical system, both microphones should be worn on the body. This reduces the extent to which the reference microphone can be used to pick up the noise signal. That is, the reference input will contain both signal and noise. Any decrease in the noise-to-signal ratio at the reference input will reduce the signal-to-noise ratio at the output of the system. The third shortfall is that, an increase in the number of noise sources or room reverberation will reduce the effectiveness of the noise reduction system.