1. Field of the Invention
The present invention relates to a speaker direction detection circuit and to a speaker direction detection method used in the circuit, and more particularly to the detection circuit and method of a device that detects the direction of a speaker as seen from the device by means of speech signals and that is used in controlling the video camera imaging angle of a device that incorporates an video camera for image input and a microphone for speech input used in a television conference device.
2. Description of the Related Art
As described in xe2x80x9c7.2 Techniques for estimating the direction of arrival and power of soundxe2x80x9d from xe2x80x9cSound Systems and Digital Processingxe2x80x9d (Oga, Yamazaki, Kaneda, Institute of Electronics, Information and Communication Engineers Conference, Mar. 25, 1995, p. 197), the cross-correlation function is typically used as a means for detecting the difference in arrival times (delay) at the two microphones in devices for detecting the direction of a speaker.
This cross-correlation function is calculated, and as shown in FIG. 1 or FIG. 2, the difference in arrival times (delay) can be detected from the maximum value of the cross-correlation function. It is generally known that the direction of arrival of sound waves can be estimated from this time difference (delay). In other words, the direction of arrival of sound waves and the time difference between signals received at each of a plurality of microphones are in a one-to-one relation, and if these time differences can be estimated, the direction of arrival of the sound waves can also be estimated.
FIG. 1 and FIG. 2 are explanatory views of a typical detection method for detecting both the horizontal direction position and vertical direction position of arriving sound waves. Angle xcex8 in the horizontal direction of sound waves that arrive at microphones M1 and M2 in FIG. 1 is detected by the equation:
L sin xcex8=xcex3Th
xcex8=sinxe2x88x921 (xcex3Th/L)
In this case, the time difference (delay) Th can be found from: sampling period (seconds)xc3x97difference (number of samples). The vertical angles of sound waves that arrive at microphones M2 and M3 in FIG. 2 can also be estimated from similar equations.
As shown in FIG. 3, sound waves that arrive from direction xcex8s are assumed to be plane waves, and it is assumed that these plane waves are received at two microphones M1 and M2 that are installed separated by a distance d from each other. At this time, the received signals "khgr"1(t) and "khgr"2(t) of each of microphones M1 and M2 are in the relation:
"khgr"2 (t)="khgr"1 (txe2x88x92xcfx84s)
xcfx84s=(dxc3x97sin (xcex8s)/c)
where c is the speed of sound.
Conversely, if the time difference (s between signals ((t) and (2(t) is known, the arrival direction (s of the sound waves can be found from the following equation:
xcex8s=sinxe2x88x921 (cxc2x7xcfx84s/d)
Based on the cross-correlation function xcfx8612(xcfx84) of "khgr"1(t) and "khgr"2(t), time difference (xcfx84s is:                                           φ            ⁢            12                    ⁡                      (            τ            )                          =                  E          ⁡                      [                                                            χ                  ⁢                  1                                ⁡                                  (                  t                  )                                            ·                                                χ                  ⁢                  2                                ⁡                                  (                                      t                    +                    τ                                    )                                                      ]                                                  =                  E          ⁡                      [                                                            χ                  ⁢                  1                                ⁡                                  (                  t                  )                                            ·                                                χ                  ⁢                  1                                ⁡                                  (                                      t                    +                    τ                    -                                          τ                      s                                                        )                                                      ]                                                  =                              φ            ⁢            11                    ⁡                      (                          τ              -                              τ                s                                      )                              
where E[xc2x7] represents the expected value, and xcfx8611 (xcfx84) represents the autocorrelation of "khgr"1(t).
Since it is known that autocorrelation function xcfx8611(xcfx84) reaches a maximum at xcfx84=0, xcfx8612(xcfx84) attains a maximum at xcfx84=xcfx84s. From this, xcfx84s is obtained if cross-correlation function xcfx8612(xcfx84) is calculated and xcfx84 that gives the maximum value is found, and the direction of the sound waves can be estimated if this value is substituted into the equation for finding the arrival direction xcex8s. Accordingly, the arrival delay time is found based on this estimation result, and the operation of converting to and outputting the speaker""s direction is then carried out.
It is already known that cross-correlation function xcfx8612(xcfx84) will have a relatively sharp peak if the frequency bandwidth is broad. Thus, xcfx84s can be accurately estimated despite the addition of noise if the peak is sharp. However, because the sharpness of the peak is influenced by the frequency bandwidth of the sound wave signal and because there is also influence from noise, some method must be used to eliminate the influence of error.
In the method disclosed in Japanese Patent Laid-open No. 123311/1995, for the purpose of controlling the image pickup angle of the camera of an image pickup device, a unidirectional microphone and a bidirectional microphone are used as the audio signal input sources and as a means for receiving the voice signal of the subject and detecting direction; and a means is used for synchronizing the output signals of these two microphones, calculating the phase difference by way of a sensitivity adjustment means of the microphones, and detecting the direction of the subject.
In the speaker direction detection method of the prior art that is described above, there is the problem that, in a case in which the output signals of the speaker direction detection device are used to control the shooting angle of the video camera, the occurrence of errors in detection of the speaker""s direction cause the video camera to be directed in a direction other than the speaker, causing great inconvenience for the users of the television conference device. Erroneous operation is particularly frequent because the results of the cross-correlation function are used without modification, and direction detection control cannot be realized without adopting some countermeasure.
In the prior art disclosed in Japanese Patent Laid-open No. 123311/1995, moreover, it is believed that erroneous detection may occur due to the influence of variations in the characteristics of the microphones when phase difference is calculated through the sensitivity adjustment means of the microphones.
It is therefore an object of the present invention to provide a speaker direction detection circuit and a speaker direction detection method that is used in the circuit that solve the above-described problems, that can reduce erroneous detection of the speaker""s direction even when signals that arrive from directions other than that of the speaker combine with the speaker""s speech signals, and that can increase stability.
The speaker direction detection circuit according to the present invention is provided with: an evaluation function means that uses added values of a cross-correlation function for each time difference to estimate arrival time differences that arise from differences in the distance for speech signals to reach two microphones; and a detection means that detects the maximum value of said added values of the cross-correlation function to detect the direction of a speaker.
Another speaker direction detection circuit according to the present invention is provided with: an evaluation function means that uses an evaluation function according to a relational formula between an autocorrelation function and a cross-correlation function to estimate arrival time differences that arise from differences in the distance for speech signals to reach two microphones; and a detection means that detects the maximum value of said evaluation function to detect the direction of a speaker.
The speaker direction detection method according to the present invention includes steps of: using added values for every time difference of a cross-correlation function to estimate arrival time differences that arise from differences in distance for speech signals to reach two microphones; and detecting the maximum value of said added values of the cross-correlation function to detect the direction of a speaker.
Another speaker direction detection method according to the present invention includes steps of: using an evaluation function according to a relational formula between an autocorrelation function and a cross-correlation function to estimate arrival time differences that arise from differences in distance for speech signals to reach two microphones; and detecting the maximum value of said evaluation function to detect the direction of a speaker.
The speaker direction detection circuit of the present invention is provided with a means for excluding errors by performing a time statistical process on values of the above-described cross-correlation function in the portion for estimating the arrival delay time that is present between two speech signals applied from an omnidirectional microphone (hereinbelow referred to as xe2x80x9ca microphonexe2x80x9d).
After performing this statistical process, which is, for example, adding cross-correlation function values for a particular time interval, a search for the maximum value is executed. Carrying out the maximum value search after performing the statistical process obtains the advantage of suppressing the occurrence of search errors to a minimum.
In addition, because the values of the cross-correlation function depend on the signal power of speech that has reached the microphones, the present invention is characterized by using [the square of cross-correlation]/[the autocorrelation of time differences (delay)] in the evaluation function in searching for maximum values.
There is the further advantage that the effect of microphone sensitivity need not be considered. The evaluation function is derived from a theoretical formula that minimizes the square error of "khgr"1(t) and "khgr"2(txe2x88x92xcfx84). Gain G that minimizes the square error and time difference can be found if it is assumed that the square error of "khgr"1(t) and "khgr"2(txe2x88x92xcfx84) is minimized every particular frame interval (N samples).
If t is n and error is e(n)="khgr"1(n)xe2x88x92"khgr"2(n)="khgr"1(n)xe2x88x92"khgr"1(nxe2x88x92xcfx84), square error E can be found from the following equation:                     E        =                  ∑                                    e              ⁡                              (                n                )                                      2                                                  =                  ∑                                    [                                                                    χ                    ⁢                    1                                    ⁡                                      (                    n                    )                                                  -                                  G                  ·                                                            χ                      ⁢                      1                                        ⁡                                          (                                              n                        -                        τ                                            )                                                                                  ]                        2                                                  =                              ∑                                          [                                                      χ                    ⁢                    1                                    ⁡                                      (                    n                    )                                                  ]                            2                                -                      2            ⁢            G            ⁢                          ∑                              [                                                                            χ                      ⁢                      1                                        ⁡                                          (                      n                      )                                                        ⁢                                                            χ                      ⁢                      1                                        ⁡                                          (                                              n                        -                        τ                                            )                                                                      ]                                              +                                    G              2                        ⁢                          ∑                                                [                                                            χ                      ⁢                      1                                        ⁡                                          (                                              n                        -                        τ                                            )                                                        ]                                2                                                        
where is the sum of n=0xcx9c(Nxe2x88x921).
In order to find G that minimizes E, the minimum solution should be found:
xe2x88x922G xcexa3["khgr"1(n) "khgr"1(nxe2x88x92xcfx84)]+G2xcexa3["khgr"1(nxe2x88x92xcfx84)2=0
∴G=xcexa3["khgr"1(n)xc2x7"khgr"1(nxe2x88x92xcfx84)]/xcexa3["khgr"1(nxe2x88x92xcfx84)]2
If this is substituted into the formula for E:
E=xcexa3("khgr"1(n)]2xe2x88x92{xcexa3["khgr"1(n) "khgr"1(nxe2x88x92xcfx84)]}2/xcexa3["khgr"1(nxe2x88x92xcfx84)]2
The square error is minimized if xcfx84 that maximizes the second term on the right side above is found. This essentially represents [square of the cross-correlation] divided by [the autocorrelation of time difference xcfx84s].
The above objects, features, and advantages of the present invention will become apparent from the following description based on the accompanying drawings which illustrate examples of preferred embodiments of the present invention.