The term “closely-spaced” as used herein to describe the position of microphones relative to one another means that the distance between adjacent microphones in an array is very much less than the distance between a microphone and a sound source detected by the microphone. Furthermore, within the frequency bands of interest, the wavelengths of sound will be longer than the spacing between the microphones.
A known speech detector using two microphones makes use of binaural cues such as the inter-microphone level differences (ILD) to detect speech. In order to make use of ILD it is necessary to assume that the speech to be detected is louder on one microphone than the other. This assumption places a constraint on the positioning of the two microphones on a device such as a mobile phone.
It is known that many speech enhancement algorithms make use of such a detector in order to operate. These speech enhancement algorithms, that make use of more than one microphone, often rely on a generalised sidelobe canceller which consists of a beamformer to capture a target sound source, and a second stage adaptive filter to remove any undesired sounds from the beamformer output without attenuating the target sound source.
Such a building block relies heavily on the availability of a speech detector which can control the adaptation of the beamformer and second stage filter correctly.
If target speech is detected, then only the beamformer will adapt, while in the absence of the target speech, only the second stage adaptive filter will adapt.
Poor performance of such a known speech detector can lead to suppression of the target signal and reinforcement of interfering (for example background) sources. Such poor performance can result in a two microphone speech enhancement system that has a performance that is worse than that of a single microphone system.
It is known that the design of a speech detector is usually governed, inter alia by a specific application and by design constraints. The way a speech detector is to be used in a specific application can be based on a priori information about the position of the speaker and any interfering sound sources.
In hearing aid applications, for example, the desired sound sources can be assumed to be located in front of the person wearing the hearing aid (a forward direction), while interfering sources are assumed to originate from behind the wearer of the hearing aid (a backward direction).
If a device in which the microphones are incorporated is positioned sideways on to a sound source, then the sound source is described as being a broadside sound source. Similarly, if the sound source is directed towards an end of the device containing the microphones the sound source is described as being in the end fire position. When considering the position of a sound source with respect to a linear microphone array and depending on the application, it is usual sources to describe directed towards one end of the array as being in the forward plane, and those directed towards the other end of the array as being in the backward plane.
The forward and backward planes are sometimes defined as the forward half plane and the backward half plane since they each span an angle of 180°, a whole plane would define 360°. Further, the location of a sound source is defined by θ, the azimuthal angle. This is the angle of incidence of the sound source relative to a central point of the array.
Design constraints such as the position of the microphones on the device also determine the information about desired/undesired sound sources that can be used, given a specific topology of the device, and the microphone positions on the device.
For example, in a known mobile phone having two microphones, a primary microphone is placed at the base of the device, and a secondary microphone is placed at the top and on a rear side of the device. The secondary microphone is thus further away from a user's mouth than the primary microphone.
With such a microphone topology, speech originating from the user of the mobile phone is in the near-field and is louder on the primary microphone than on the secondary microphone. Background noise and other noise interference sources are in the far field and are thus equally loud on both microphones. By exploiting the inter-level difference between each of the microphones, the target speech may be properly detected.
In a known speech detector comprising a plurality of closely-spaced microphones, a common detection technique is to first apply differential processing to the microphone signals. This procedure produces forward and backward facing cardioid signals using two omnidirectional microphones, assuming that the microphones are closely spaced. If the target sound sources are assumed to originate from the forward direction, for example, then the ratio between the powers on the forward and backward cardioid microphones should be very large. For interfering sources originating from the backward direction, this ratio will be very small, while for diffuse noise, the ratio should be close to unity.
This forward-backward cardioid processing of microphone signals is a commonly used detection method with closely-spaced microphones. A problem with this type of detector is that it is not able to easily adapt to different microphone configurations or to different ways that the device may be handled by the user. In other words, this type of detector is not suitable in situations where the speech does not originate from the forward direction.
This can be a particular problem with mobile phones, for example, because a user may change the orientation of the phone relative to the mouth of the user and thus speech will not necessarily always originate from a forwarded direction relative to the microphone.
Another problem with known speech detectors of this type is that it is necessary to match the power of each microphone within a particular tolerance. In other words, it is necessary to calibrate the microphones.
According to a first aspect of the present invention there is provided a method for detecting speech using a first microphone adapted to produce a first signal and a second microphone adapted to produce a second signal, the method comprising the steps of:                (i) applying gain to the second signal to produce a normalised second signal, which signal is normalised relative to the first signal;        (ii) constructing one or more signal components from the first signal and the normalised second signal;        (iii) constructing an adaptive differential microphone (ADM) having a constructed microphone response constructed from the one or more signal components which response has at least one directional null;        (iv) producing one or more ADM outputs from the constructed microphone response in respect to detected sound;        (v) computing a ratio of a parameter of either a first signal component or a constructed microphone response to a parameter of an output of the ADM;        (vi) comparing the ratio to an adaptive threshold value;        (vii) detecting speech if the ratio is greater than or equal to the adaptive threshold value.        
According to a second aspect of the present invention there is provided a speech detector comprising:                a first microphone adapted to produce a first signal;        a second microphone adapted to produce a second signal;        an amplifier adapted to apply a gain to the second signal to produce a normalised second signal, which signal is normalised relative to the first signal;        a first processor for constructing one or more signal components from the first and normalised second signals;        a second processor for constructing an adaptive differential microphone having a constructed microphone response comprising at least one directional null, the ADM producing one or more outputs in response to detected sound;        a third processor for computing the ratio of a parameter of either a first signal component or a constructed microphone response to a parameter of an output of the ADM;        a comparator for comparing the ratio to an adaptive threshold to detect if the ratio is greater than or equal to the value of the adaptive threshold; and        a detector for detecting speech when the ratio is greater than, or equal to the value of adaptive threshold.        
According to a third aspect of the present invention there is provided an adaptive differential microphone (ADM) forming a speech detector according to a second aspect of the present invention.
Because the constructed microphone response of the ADM comprises at least one directional null, by means of embodiments of the present invention it is possible to substantially suppress a target sound source, such as target speech by directing the null to the source of the target speech. If the directional null is directed in this way, the one or more outputs of the ADM will be small since the target speech will be substantially suppressed. This means that the ratio formed between a parameter of either a first signal component or a constructed microphone response to the parameter of an output of the ADM will be large. When the ratio is greater than or equal to the adaptive threshold value then speech will be detected.
If, on the other hand, the null is directed towards background, or interference sound, then the influence of the null will be less, and as a result, the ratio formed between a parameter of either a first signal component or a constructed microphone response to the parameter of an output of the ADM will be much smaller than for the target speech. This in turn means the ratio will be less than the value of the adaptive threshold resulting in no speech being detected.
This is because, if a user is in the near-field, then sound emanating from his mouth is more direct and usually has a higher power than other sound sources in the environment of the adaptive differential microphone. Therefore, if a null is steered in the direction of the user's mouth, the ADM can suppress a large part of the signal. This means that the ADM signal will be much smaller than the signal component or the constructed microphone response.
For diffuse noise and point interference(s), the ratio will be below the threshold, and no speech will be detected.
The method according to the first aspect of the invention may comprise a further step of estimating a value of an adaptive factor β.
The adaptive threshold is determined by an adaptive factor β as will be explained in more detail hereinbelow. The adaptive factor β also determines the orientation of the directional null as also explained hereinbelow. The orientation of the directional null and the value of the adaptive threshold are thus both determined by the adaptive factor β.
Because both the orientation of the directional null and the adaptive threshold are both dependent upon the value of β, the threshold is in effect tailored to the current value of β which determines the response of the ADM.
The method according to the first aspect of the present invention may comprise the following further steps:                (viii) adapting the value of the adaptive factor β;        (ix) recomputing the ratio;        (x) comparing the recomputed ratio to an adapted threshold value;        (xi) detecting speech if the ratio is greater than the adapted threshold value.        
By adapting the value of the adaptive factor β as appropriate, the directional null may be appropriately steered towards a target speech source. This will result in the target speech source being substantially suppressed by the ADM and will result in the ratio being greater than or equal to the adaptive threshold value, thus resulting in speech being detected.
Due to the adaptive nature of embodiments of the invention, the value of β may be varied as appropriate in order to ensure that the directional null is appropriately oriented.
In embodiments of the invention the ratio may be formed by comparing the power of either a signal component or a constructed microphone response to the power of an output of the ADM.
In other embodiments of the invention, the ratio may be formed by comparing other parameters such as the absolute values of either a signal component or a constructed microphone response to the absolute value of an output of the ADM. If such a ratio is used, the adaptive threshold will need to be modified accordingly.
The output of the ADM may comprise a first output yb produced in response to sound detected in the back plane, and a second output yf produced in response to sound detected in the front plane. In such embodiments, a ratio may be calculated in respect of each of the outputs of the ADM separately. Depending on the value of the two ratios, a decision can be made as to whether a speech source is positioned in the forward or backward plane.
For a speech detector that is part of a hand set such as a mobile phone, the near-field effects of propagating waves are predominant. Far field effects, which are usually valid for hands free scenarios, are commonly assumed for the analysis of small microphone arrays. In particular, assumptions of planar wave fronts and equal microphone levels facilitate the construction of so called eigenbeams for closely-spaced microphones.
Using two microphones, these eigenbeams correspond to a monopole and a dipole. Combinations of these eigenbeams can produce various first-order differential responses.
In one embodiment of the invention, two signal components are constructed from the first and normalised second signals. However, in other embodiments, more than two signal components may be constructed.
In some embodiments of the invention the first signal component comprises a monopole signal.
In such embodiments, or in other embodiments, the second signal component may comprise a dipole signal.
The constructed microphone response may take any particular form as long as it comprises a null. A null is defined as part of a signal where the response is zero.
Preferably, the constructed microphone response comprises a first response and a second response.
In embodiments of the invention, the first response comprises a forward facing cardioid signal, and the second response comprises a backward facing cardioid signal.
In such an embodiment, the forward and backward cardioids are used to adaptively construct a microphone response containing a null in the direction of a strong point source particularly a source of speech. However, these forward and backward cardioids are themselves constructed from the aforementioned eigenbeams (the monopole and dipole), and as such the fundamental shapes which can produce all other first-order shapes are the monopole and dipole.
Such an embodiment of the invention offers a natural and more general extension to the backward-forward cardioids detector.
In other embodiments of the invention the first and second responses may comprise oppositely facing first-order response signals, for example.
The first and second microphones produce a first and a second signal respectively in response to sound emanating from one or more sound sources, which sound is detected by one or both of the microphones.
The second signal is then normalised relative to the first signal by applying a gain to the second signal. The gain may be either positive or negative.
By means of embodiments of the invention, it is thus not necessary to calibrate the first and second microphones since the second signal is normalised relative to the first signal before speech is detected.
The first and second microphones may be any desired type of microphone, and in some embodiments of the invention they each comprise an omnidirectional microphone.
In order to further understand the invention, the nature of first-order differential microphones will now be considered with respect to an embodiment of the invention in which the constructed microphone response comprises forward and backward facing cardioids, and the first and second signal components comprise a monopole and dipole signal respectively.
A forward and backward facing cardioid can be constructed assuming that the microphones are closely-spaced (this equates to the condition kd<<π, where k=w/c is the wave number, d is the distance between the microphones, c is the speed of sound and w is the angular frequency of the sound).
The general form for oppositely-facing first-order super-directional responses is:Vf=αVm+(1−α) Vd  (1)Vb=αVm−(1−α) Vd  (2)where α determines the resulting first-order response). Specifically, for 0<α≦0.5, the directional response contains at least one null. α therefore controls the location of the null (or nulls) in the first-order microphone response, with the monopole response Vm, and the normalized dipole response Vd is given by
                                          V            _                    ⁢          d                =                              1            jw                    ⁢                      c            d                    ⁢          Vd                                    (        3        )            where Vd is the dipole response. The term 1/(jw) is the (ideal) integrator response, and c/d is a normalization factor. Ideally, (1) and (2) simplify toVf=0.5(1+cos θ)Vb=0.5(1−cos θ)  (4)for forward- and backward-facing cardioids (α=0.5), where θ is the azimuthal angle defining the location of the sound source and is frequency-independent for small microphone spacings.
As mentioned hereinabove, the fundamental building blocks of the forward and backward cardioids are combinations of the monopole and dipole signal which are dependent on the α factor. The values of α will be different for other first-order microphone responses. In other words, the shape of the first-order response depends on the value of α.
The subscripts f and b refer to the forward plane and the backward plane respectively, and θ is the angle of incidence for the sound source. These variables are illustrated in FIGS. 1 and 2, where M1 denotes a first microphone, M2 denotes a second microphone, r is the distance of the sound source from the first microphone, r2 is the distance of the sound source from the second microphone, and r is the distance of the sound sources from the centre of the array.
The directivity factor (Q) for a first-order (normalized) differential microphone can be expressed in terms of α with
                              Q          ⁡                      (            α            )                          =                  3                                    4              ⁢                                                          ⁢                              α                2                                      -                          2              ⁢                                                          ⁢              α                        +            1                                              (        5        )            where 10 log [Q(α)] is the directivity index.Q is defined as the gain of a microphone array in a noise field over that of an omnidirectional microphone.
As can be seen from equation 5, when a null is steered towards a desired speech source by varying α, the directivity factor Q, which depends on alpha is altered as well.
The power in the second microphone M2 is normalised relative to the power of the first microphone M1 in order to mitigate near-field effects when constructing the forward and backward cardioid signals.
This is achieved by applying a gain G to the second microphone M2.
This operation may be given by
                              G          ⁡                      (            m            )                          =                              ɛ            ⁢                                                                                ∑                                          n                      =                      1                                        M                                    ⁢                                                            x                      1                      2                                        ⁡                                          (                      n                      )                                                                                                            ∑                                          n                      =                      1                                        M                                    ⁢                                                            x                      2                      2                                        ⁡                                          (                      n                      )                                                                                                    +                                    (                              1                -                ɛ                            )                        ⁢                          G              ⁡                              (                                  m                  -                  1                                )                                                                        (        6        )            where x1 and x2 are the signals fed to the beamformer, M is the block length, and ε is a smoothing parameter. This step makes the speech detector independent of microphone mismatch by scaling x2 by G. A very small constant can also be added to the denominator of the first term in (6) to prevent division-by-zero.
A speech detector according to an embodiment of the invention may be used to detect speech from a point source positioned in either the front plane or the back plane. If the speech to be detected is in the front plane, then the output of the ADM is yf. Similarly, if the speech to be detected emanates from a point source in the back plane, then the output of the ADM is yb.
Depending on the location, one or both of the signals can be used for the detection process.
Let cf (n) and cb (n) denote the forward and backward cardioid signals, respectively, with sample index n. An ADM is constructed by finding the optimum βb that minimizes the mean-square error (MSE) ofyb(n)=cf(n)−βbcb(n)  (7)where β is an adaptive factor used to control the resulting adaptive differential microphone response. Different values of β produce different responses with nulls in specific locations.
It can be shown that the MSE is a quadratic function of βb and therefore displays a unique minimum at:
                                          β            b                    =                                    R              fb                                      R              bb                                      ,                            (        8        )            with Rjb=E{cf(n)cb(n)} the cross correlation between forward and backward cardioid signals, and Rbb=E{|cb(n)|2} the power of the backward cardioids signal. For an interference located in the rear-half plane, the range of values for βb is [0,1]. Methods for estimating/adapting βb include a normalised least mean square (NLMS) form given byβb(n+1)=βb(n)+2μy(n)cb(n)/|cb(n)|2,  (9)where μ is the adaptation step-size, or a block-based approach and estimates the cross- and auto-correlation terms in (8) to estimate βb,β can thus be estimated using either equation 8 or equation 9.Rfb, and Rbb may be estimated using equations 10 and 11 below.
                                                        R              ^                        fb                    ⁡                      (            m            )                          =                                            ξ              M                        ⁢                                          ∑                                  n                  =                  1                                M                            ⁢                                                                    c                    f                                    ⁡                                      (                    n                    )                                                  ⁢                                                      c                    b                                    ⁡                                      (                    n                    )                                                                                +                                    (                              1                -                ξ                            )                        ⁢                                                            R                  ^                                fb                            ⁡                              (                                  m                  -                  1                                )                                                                        (        10        )                                                                                    R                ^                            bb                        ⁡                          (              m              )                                =                                                    ξ                M                            ⁢                                                ∑                                      n                    =                    1                                    M                                ⁢                                                      c                    b                    2                                    ⁡                                      (                    n                    )                                                                        +                                          (                                  1                  -                  ξ                                )                            ⁢                                                                    R                    ^                                    bb                                ⁡                                  (                                      m                    -                    1                                    )                                                                    ,                            (        11        )            
Where m is the block index, {circumflex over (R)}fb is an estimate of Rfb, {circumflex over (R)}bb is an estimate of Rbb, M, the block length, and ξ a smoothing parameter (0<ξ<1).
Equations 10 and 11 should therefore be used in conjunction with equation 8 if equation 8 is used to estimate β.
The above analysis assumes that the location of the desired speaker to be suppressed is in the rear-half plane, which spans the azimuthal range π/2≦θ≦3π/2. This analysis can also be repeated for a point source in the front-half plane (−π/2≦θ≦π/2) usingyf(n)=cb(n)−βfcf(n)  (12)
Using (4) and (7), the effective response of a resulting ADM can be written in terms of βb as
                              V          b                =                              (                                          1                -                                  β                  b                                            2                        )                    +                                    (                                                1                  +                                      β                    b                                                  2                            )                        ⁢            cos            ⁢                                                  ⁢            θ                                              (        13        )            which, for 0<βb<1, is a first-order differential response normalized to 1 in the forward direction (i.e. θ=0) with
                    α        =                  (                                    1              -                              β                b                                      2                    )                                    (        14        )            
Note the similarity to equation (4). The directional null of this response can be written in terms of βb by setting Vb in (13) to zero,
                              θ          b                =                              arccos            ⁡                          (                                                                    β                    b                                    -                  1                                                  1                  +                                      β                    b                                                              )                                .                                    (        15        )            
The forward counterpart of the directional null in (15) can also be derived by assuming that the interference is in the front-half plane as in (12), and is given by
                              θ          f                =                              arccos            ⁡                          (                                                1                  -                                      β                    f                                                                    1                  +                                      β                    f                                                              )                                .                                    (        16        )            
Here, the value θf is defined for βf≧0.
Thus by means of embodiments of the invention the directional null of the ADM response may be steered by appropriately varying β, the adaptive factor. When varying β, equation 8 or 9 above may be used.
In (15), as βb→∞, θ→0° i.e. the null is placed in the front-half plane. In fact, for βb>1, the direction of the steered null moves into the front half-plane. This means that even if a desired point source is not strictly located in the rear-half plane, it can still be detected.
In (16), as βf→∞, θ→0° i.e. the null is placed in the rear-half plane. The condition relating βf and βb when θb=θf, can be found by equating (15) and (16),βbβf=1.  (17)
To place a null at 0°, requires a very large value for βb, while placing a null at 180° requires a very large value for βf. For a source in broadside, both βb and βf equal one, and the condition in (17) is satisfied.
FIG. 6 illustrates the directional response of an ADM according to an embodiment of the invention for various values of β.
If βb>1, then the null is placed in the front-half plane at the cost of an absolute response of βb at 180°. In such situations, the relation in (17) also provides a method for calculating a value for βf that leads to a normalized first-order differential response. The value of βf=1/βb together with (12) gives a normalized response at 0° with a null in the same direction in the front-half plane. This effect can be clearly seen in FIG. 4 where two directional responses exhibit the same null at approximately 71°, but one has a lower directivity factor (shown as a dashed line).
Speech may be detected using a ratio using yb(n) and another component of the processed signal, in particular, either an omnidirectional, monopole, or forward facing cardioid component of the processed signal. Desired speech is detected if
                              Λ          =                                                                                                            z                    ⁡                                          (                      n                      )                                                                                        2                                                                                                    y                    ⁡                                          (                      n                      )                                                                                        2                                      >            δ                          ,                            (        18        )            where δ is a positive threshold, and z(n) one of the aforementioned signals. The value of y(n) can be yb(n) and/or yf(n). In the following embodiment, z(n) is assumed to be the monopole signal.
In the absence of a desired speaker, and assuming a spherically isotropic noise field, the ratio in (18) is related to the directivity factor of a first-order response dependent on βb. For a first-order response, (5) can be rewritten in terms of β (which applies to both βb and βf) using (14) and (5),
                                          Q            ⁡                          (              β              )                                =                      3                                          β                2                            -              β              +              1                                      ,                  0          ≤          β          ≤          1.                                    (        19        )            
The use of Q(β) as a threshold to compare to Λ is justified for kd<<π, since only then can the directivity factor (in diffuse noise) of a monopole be shown to be unity. This is important because it makes comparing the ratio calculated in equation 18 to the adaptive threshold in (19) correct. In other words, the (theoretical) adaptive threshold in (19) assumes that the directivity of a monopole is unity in all directions. Furthermore, a monopole derived by summing up the two omni-directional microphone signals has a unity response only for kd<<π)
The value of δ can be set toδ=σQ(β)  (20)where a σ≧1 is an overcompensation factor.
It can be shown that the over-compensation factor σ is related to Q and the signal-to-noise ratio (SNR). In fact the ratio of monopole to ADM power is shown to equal the product of Q and a term that depends on the SNR,Λ=(σS2/ρ2+1)Q(β),  (21)where σS2 is the power of the desired signal and ρ2 is the power of the noise signal. This would mean that for an SNR of 0 dB (σS2=ρ2), σ=2−ε (where ε is a small constant) is an appropriate value of overcompensating the threshold. (Depending on the conditions, the value of σ can be adjusted to the working conditions, i.e. to the sensitivity of the detector for large values of σ is the detector is less sensitive while for lower values such as σ=2−ε the detector can be more sensitive).
Thus it can be seen that the adaptive threshold is also dependent on the value of β. This means that when the value of β is changed in order to steer the null, the value of the adaptive threshold will also be modified. In other words different values of β will result in different locations of the null(s) which means a different directivity pattern of the adaptive differential microphone (ADM). This in turn means a different directivity factor Q. As such the threshold should be adapted to get a ‘fair’ comparison. For example, if the null is steered so as to produce a hyper-cardioid response for the ADM, while the threshold uses a beta value from a cardioid response, then speech would be detected even in diffuse noise conditions. Therefore, the threshold is tailored to the current value of β which determines the response of the ADM.
In addition, to increasing σ, a lower bound can be set for the value of Q(β) in case the value of β is not bounded between 0 and 1. A suitable value for this lower bound is 3, which corresponds to the minimum directivity factor for βb[0,1], i.e.
                    δ        =                              σ            b                    ⁢                                    max              ⁡                              (                                  3                  ,                                      3                                                                  β                        b                        2                                            -                                              β                        b                                            +                      1                                                                      )                                      .                                              (        22        )            
If the value of βb is greater than 1 (because a point source is in the front-half plane), for example, then with a lower bound, a quasi-penalty is applied to this source, making it more difficult to detect as speech. The greater the value of βb (and consequently the closer the directional null is to 0°) the higher the penalty incurred (in the form of a reduced directivity) as the value of Λ decreases, while the minimum threshold value remains the same. The threshold values depend on β as long as the resulting directivity factor in (22) is larger than 3 for this embodiment of the adaptive threshold. In equation (19) the threshold is automatically bounded below by 3 since we assume that β is bounded between [0,1]. However, in the embodiment of (22) we only require that β>0. Since β can therefore be >1, it should be bounded below.
Restricting the value of β to a subinterval of [0,1] can be used when the possible location of a desired speaker is known to lie within a specific azimuthal range. In this case, (15) and (16) can be solved for βb and βf to drive the desired bounds.