1. Field of the Invention
The present invention relates to a sound source direction detecting apparatus, a sound source direction detecting method, and a sound source direction detecting camera for detecting the direction of a speaking person by analyzing the sound vocalized by that person illustratively during a conference.
2. Description of the Related Art
There exist video conferencing systems for linking speaking persons illustratively at remote locations during what is known as a video conference. With this type of system in operation, the remarks and gestures of the people participating in the video conference are exchanged in real time between the linked remote locations. One such video conferencing system is typically made up of microphones for collecting sounds emanating from the ongoing conference, a camera for imaging the participants, sound source detecting microphones incorporated in the camera so as to collect ambient sounds, and a sound source direction detection section for detecting the direction of the sound source (i.e., speaking person) based on the ambient sounds collected by the sound source detecting microphones. The video conference system also includes a drive part that points the camera in the speaking person's direction detected by the sound source direction detection section, and arrangements that convert the video frames imaged by the camera and the audio frames collected by the detecting microphones into a suitable transmission format before sending the converted data to another conference system set up in the opposite remote location.
The sound source direction detection section detects the direction of the speaking person relative to the camera by analyzing his or her voice. When the speaking person's direction is detected, the drive part points the camera to the speaking person accordingly and starts imaging that person. Diverse methods have been proposed and utilized for determining the speaking person's direction (called the sound source direction hereunder). Outlined below in reference to FIGS. 12A through 12C is how the sound source direction is usually determined using two microphones.
FIG. 12A shows how two microphones are positioned relative to the sound source. Two microphones are generally used to detect the sound source direction. A first microphone 101a is separated from a second microphone 102a by a distance D. When a perpendicular line is drawn to the midpoint of a line segment linking the first microphone 101a with the second microphone 102a, an angle θ is formed between the perpendicular on the one hand and the arrows 101b and 102b on the other hand. The arrows at the angle θ denote the direction of a sound source 100. It is assumed that the distance from the first microphone 101a or the second microphone 102a to the sound source 100 is sufficiently longer than the distance D between the first microphone 101a and the second microphone 102a. Thus the arrows 101b and 102b, indicating the direction of the sound coming from the sound source 100 and input to the first and the second microphones 101a and 102a, are considered parallel to each other.
In this case, there is a distance L between the first microphone 101a on the one hand, and an intersection point formed by a perpendicular drawn from the second microphone 102a to the arrow 101b and by the latter line segment on the other hand. The distance L corresponds to a difference in time between the two microphones when they receive sound waves coming from the sound source 100. In other words, dividing the distance L[m] by the sonic velocity [m/s] provides the difference between two points in time, i.e., between the time when a wave surface in phase with the sound wave generated by the sound source reaches the second microphone 102a, and the time when the wave surface reaches the first microphone 101a. The value of sin θ is then obtained from the distance D between the two microphones and from the distance L calculated from the time difference. With the value of sin θ calculated, the camera is pointed in the sound source direction A accordingly.
FIG. 12B shows on a complex plane the sounds detected by the first and the second microphones 101a and 102a. As indicated in FIG. 12B, there is a phase difference φ between two vectors: vector B representing the sound detected by the first microphone 101a, and vector C denoting the sound detected by the second microphone 102a. The phase difference φ is attributable to the fact that the distance between the first microphone 101a and the sound source 100 is different from the distance between the second microphone 102a and the sound source 100 while the sound waves come from the same sound source. Taking the effects of the phase difference φ into consideration makes it possible to acquire the difference between two points in time, i.e., between the times when sound waves of a given frequency component reach the first microphone 101a and when sound waves of the same frequency component reach the second microphone 102a. The time difference thus obtained in turn allows the value of sin θ to be calculated, whereby the sound source direction is detected.
Sounds are first collected at intervals of a predetermined unit time and decomposed illustratively through fast Fourier transform (FFT) into frequency components making up the vectors for estimating the sound source direction. The phase difference φ between the first microphone 101a and the second microphone 102a is thus obtained. The lengths of the vectors found on the complex plane denote the sound power levels of the frequency components involved. Ideally, the sound source direction detected by the first microphone 101a should coincide with the sound source direction detected by the second microphone 102a, the direction being that of the vector B shown in FIG. 12B. Illustratively, the phase difference is zero if the sound source is located in the front (i.e., when the distance from the first microphone 101a to the sound source 100 is equal to the distance from the second microphone 102a to the sound source 100). A phase difference occurs if the sound source is located diagonally in front (i.e., when the distance from the first microphone 101a to the sound source 100 is different from the distance from the second microphone 102a to the sound source 100). That is, a plurality of vectors on the complex plane reveal the existence of a phase difference.
FIG. 12C shows a typical histogram acquired through analysis in ±90-degree directions relative to the front facing the first and the second microphones 101a and 102a (i.e., the direction of a perpendicular to the line segment linking the first microphone 101a with the second microphone 102a). In the histogram of FIG. 12C, the horizontal axis stands for values of sin θ and the vertical axis for additional power levels. Because the human voice contains various frequencies, power levels are calculated for each of the frequencies involved. At each of the angles involved, the acquired power level is added to the histogram. The results point to the angle of the sound source direction.
The values of sin θ include |sin θ|>1 for the reasons to be explained herein. It is obvious that |sin θ|≦1 is included. Ordinarily, the following expression (1) is used to find the value of sin θ:
                              sin          ⁢                                          ⁢          θ                =                              time            ⁢                                                  ⁢            difference            ⁢                                                  ⁢            of            ⁢                                                  ⁢                          1              /              f                        ×                          φ              /              2                        ⁢            π            ×            sound            ⁢                                                                      ⁢                                                                    ⁢            velocity                                distance            ⁢                                                  ⁢            between            ⁢                                                  ⁢            microphones                                              (        1        )            where, f[Hz] stands for the frequency and φ for the phase difference.
If the value of sin θ is determined on the basis of time difference, sonic velocity, and distance between microphones and if the second microphone 102a is reached by sound waves earlier than the first microphone 101a, then the time difference takes on a positive value. If the second microphone 102a is reached by sound waves later than the first microphone 101a, then the time difference becomes a negative value. Thus the value of sin θ can be positive or negative. If in the expression (1) above the numerator is larger than the denominator, then the value of sin θ can be smaller than −1 or larger than 1. The values that occur when |sin θ|>1 stem from errors or sound wave diffraction. For these reasons, histograms displayed when |sin θ|>1 need also be taken into consideration.
Where sound is collected by a plurality of microphones, the estimated angle for each of the frequencies involved is added to the histogram as described. The angle at which the power level is highest is then detected as the sound source direction.
Japanese Patent Laid-Open No. Hei 7-336790 discloses a microphone system that collects a plurality of sound signals and finds a time lag therebetween as well as the highest power level of the collected signals. The time lag and the highest power level are used to switch from one sound signal to another in collecting the sound.
Japanese Patent Laid-Open No. 2004-12151 discloses a sound source direction estimating apparatus with arrangements for preventing degradation of the accuracy in estimating where the sound source is located amid reflected sounds and noises which have been input concurrently.
Japanese Patent Laid-Open No. 2006-194700 discloses techniques for minimizing those errors in the sound source direction which are attributable to reverberations.