1. Field of the Invention
The present invention relates to an apparatus and a method for processing voice signals, and more particularly to such an apparatus and a method applicable to, for example, telecommunications devices and software treating voice signals for use in, e.g. telephones or teleconference systems.
2. Description of the Background Art
As a noise suppression scheme, available is the voice switch, which is based upon a targeted voice section detection in which from input signals temporal sections are determined in which a targeted speaker is talking, i.e. “targeted voice sections”, to output signals in targeted voice sections as they are while attenuating signals in temporal sections other than targeted voice sections, i.e. “untargeted voice sections”. For example, when an input signal is received, a decision is made on whether or not the signal is in a targeted voice section. If the input signal is in a targeted voice section, then the gain of the voice section, or targeted voice section, is set to 1.0. Otherwise, the gain is set to an arbitrary positive value less than 1.0 to amplify the input signal with the gain to thereby attenuate the latter to develop a corresponding output signal.
As another noise suppression scheme, the Wiener filter approach is available, which is disclosed in U.S. patent application publication No. US 2009/0012783 A1 to Klein. According to Klein, background noise components contained in input signals are suppressed by determining untargeted voice sections, from which noise characteristics are estimated for the respective frequencies to calculate, or estimate, Wiener filter coefficients based on the noise characteristics to multiply the input signal by the Wiener filter coefficients.
The voice switch and the Wiener filter can be applied to a voice signal processor for use in, e.g. a video conference system or a mobile phone system, to suppress noise to enhance the quality of voice communication.
In order to apply the voice switch and the Wiener filter, it is necessary to distinguish targeted voice sections from untargeted voice sections, which may include “disturbing voice” uttered by a person other than the targeted speaker and/or “background noise” such as office or street noises. To take an example of distinction method available, the targeted/untargeted voice sections may be distinguished by means of a property known as coherence. In the context, coherence may be defined as a physical quantity depending upon an arrival direction in which an input signal is received. In an application of cellular phones, for example, targeted voices are distinguishable from untargeted voices in arrival directions so that the targeted voice, or speech sound, arrives from the front of a cellular phone set whereas among untargeted voice disturbing voice tends to arrive in directions other than the front and background noise is not distinctive in arrival direction. Accordingly, targeted voices can be discriminated from untargeted voices by focusing on the arrival directions thereof.
It will now briefly be described why coherence may be used in order to discriminate targeted voice sections from untargeted voice sections. In a normal detection of targeted voice sections, targeted voice sections may be discriminated from untargeted voice sections based on fluctuation in level of an input signal. In this method, it is impossible to discriminate between disturbing voice and targeted voice and, therefore, disturbing voice cannot be suppressed by the voice switch. Thus, the untargeted voice suppression will be insufficient. By contrast, in a detection relying on coherence, discrimination is made using the arrival directions of input signals. Hence, it is possible to discriminate between targeted and disturbing voices which arrive from the directions distinctive from each other. The untargeted voice suppression can effectively be attained by means of the voice switch.
When using the voice switch together with the Wiener filter, more effective noise suppression could be attained than where both measures are used separately since the voice switch effectively suppresses untargeted voice sections and simultaneously the Wiener filter effectively suppresses noise components involved in targeted voice sections.
Although the voice switch and the Wiener filter are classified into a noise suppressing technique, they are different in noise sections to be detected for the purpose of optimal operation. It is sufficient for the voice switch to have the capability of detecting untargeted voice sections which contain either or both of disturbing voice and background noise. By contrast, the Wiener filter has to detect temporal sections only containing background noise, or “background noise sections”, among untargeted voice sections. Because, if a filter coefficient were adapted in a disturbing voice section, then the character of “voice” that disturbing voice contains would also be reflected on a Wiener filter coefficient which should have been applied to noise, thus causing even voice components targeted voice contains to be suppressed so as to deteriorate the sound quality.
As described so far, when the voice switch and Wiener filter are used in combination, their respectively optimal temporal sections would have to be detected. In spite of this, in the prior art, the same reference was applied between the voice switch and the Wiener filter for detecting untargeted voice sections, raising a problem that a Wiener filter coefficient reflected form the characteristics of disturbing voice may deteriorate targeted voice.
This problem could be solved by using plural schemes in parallel which are respectively appropriate for a voice switch and a Wiener filter for detecting untargeted voice sections to thereby detect appropriate temporal sections. In this case, the amount of computation would be increased. In addition, adjustment would have to be made on plural parameters behaving differently from each other, raising a further problem that the user of the system would further be burdened with computation.