The present technique relates to a sound signal processing apparatus, a sound signal processing method, and a program, and more particularly, to a sound signal processing apparatus, a sound signal processing method, and a program capable of executing a speech section detection process accompanied with sound source direction estimation.
Hereinafter, as techniques according to the related art, speech section detection will be first described on the whole and a method of processing speech section detection using sound source direction estimation will be subsequently described.
The speech section detection (SD: Speech Detection) refers to a process of cutting a section in which a person utters in a sound signal input via a microphone included in, for example, a sound signal processing apparatus. The speech section detection (SD) is also referred to as VAD (Voice Activity Detection).
In the specification, a process of cutting out the utterance section of a person from a sound signal will be described as “speech section detection” or simply “section detection.” Further, a “section” is not limited to a section for speech, but may indicate a section in which a given sound source continues to be active (continues to generate sound).
The speech section detection is sometimes used along with speech recognition, sound source extraction, or the like. However, in either case, high accuracy is necessary in the section detection.
For example, since processing such as matching on the section cut out by the section detection is executed in many sound recognition devices, the accuracy of the section detection has a great influence on the accuracy of speech recognition. That is, when there is a difference between the actually uttered section and the section detected by a section detector, the difference may cause erroneous recognition.
On the other hand, in the sound source extraction, the section detection is sometimes used. For example, when a clear voice is desired to be extracted from a signal in which voice and noise are mixed with each other or when the voice of one person is desired to be extracted in an environment in which two or more persons utter simultaneously, it is necessary to divide an input signal into a section, in which only the noise is generated, and a section, in which both the voice and noise are generated, in accordance with a method of extracting sound sources. Therefore, in order to divide the input signal into these sections, section detection is used.
The section detection may sometimes be used in order to reduce a calculation amount or preventing the adoption to a silent section by extracting the sound source only when a target voice is present alone. In the speech section detection used along with the sound source extraction, it is necessary to operate with high accuracy even in an input signal in which voice and noise are mixed with each other or in which voices are mixed with each other.
In order to meet the above-mentioned uses, various suggestions have been made to improve the accuracy in the speech section detection. Here, focusing the number of microphones to be used, the suggestions are classified into the following two methods.
(1) Method of Using Single Microphone
This method is a method of extracting a feature indicating “voice likeness” from the input signal and executing the section detection based on the value.
This process is disclosed in, for example, Japanese Patent No. 4182444.
(2) Method of Using Plurality of Microphones
This method is a method of executing the section detection using the directions of sound sources.
This process is disclosed in, for example, Japanese Patent No. 4282704 and Japanese Unexamined Patent Application Publication No. 2010-121975.
The technique disclosed in the present specification uses method (2) above, that is, the method of using the plurality of microphones. Therefore, hereinafter, the overview of a method of using a sound source direction of the method (2) will be described.
The fundamental idea of the speech section detection based on the sound source direction is as follows.
Sounds generated from the same sound source arrive in the same direction viewed from a microphone. Therefore, the direction of arrival (DOA) of the sound source is estimated at a predetermined time interval, a section in which the sounds in the same direction continue to be generated is calculated, and the section is determined as a section in which the sound source is active (the sound is generated from the sound source). When this process is executed on the utterance of a human being, a speech section is detected.
Hereinafter, the direction of arrival (DOA) from the sound source is also simply referred to as a “sound source direction.”
When the method of estimating the sound source direction is applied to each of the plurality of sound sources, a section can be calculated for each sound source in spite of the fact that the plurality of sound sources are simultaneously active (for example, even when the voices of a plurality of persons are overwritten).
For example, in the case where immediately before the end of the utterance from a person, another person starts to utter, a long region in which both the utterances are connected to each other is detected as one section in the method of using the “voice likeness”, whereas respective sections of the utterances can be distinguished from each other and can be detected in the method of estimating the direction.
The overview of the method of detecting the speech section using the sound source direction estimation will be described with reference to FIGS. 1A to 1D.
FIG. 1A is a diagram illustrating an image of an input signal (or also referred to as an “observation signal”). Two persons utter “Hello” and “Good-by”, respectively.
As shown in FIG. 1B, the input signal is divided into blocks with a predetermined length.
A block 11 shown in FIG. 1B indicates one of the divided blocks. The length of the block has a sufficiently short value in comparison to the length of a normal utterance. For example, the length is set to 1/10 seconds or ⅛ seconds.
The estimation of the sound source direction is executed on each block.
FIG. 1C shows the estimation result. The horizontal axis represents a time and the vertical axis represents a direction. The direction refers to an angle (see FIG. 2) of the sound source direction with respect to a microphone into which voice is input.
The points shown in FIG. 1C are direction points 12. The direction points indicate the sound source directions calculated inside each block.
Hereinafter, a point corresponding to the sound source direction is referred to as a “direction point.” When a direction estimation method for a plurality of sound sources is used, each block can have a plurality of direction points.
Next, the direction points in the nearly identical direction are connected between the blocks. This process is referred to as tacking.
FIG. 1D shows the tracking result, that is, the connected direction points.
Lines 15 and 16 shown in FIG. 1D indicate a section in which each sound source is active, that is, a section of voice utterance.
As a method of calculating the sound source direction in each block, for example, Japanese Patent No. 4282704 described above discloses a process of using a “beamformer suppressing a signal arriving from an object sound source.”
Further, Japanese Unexamined Patent Application Publication No. 2010-121975 described above discloses a process of using a MUSIC method.
In each process, basically, a spatial filter in which a null beam is directed in the direction of a sound source is generated and the direction of the null beam is set as the sound source direction. Hereinafter, the MUSIC method will be described.
The MUSIC method is an abbreviation of MUltiple SIgnal Classification. The MUSIC method can be explained as the following two steps (S1) and (S2) from the viewpoint of space filtering (process of transmitting or suppressing a sound in a specific direction). The details of the MUSIC method are described in Japanese Unexamined Patent Application Publication No. 2008-175733 or the like.
(S1) A spatial filter is generated such that a null beam is directed in the directions of all the sound sources generating voices within a given section (block).
(S2) A directivity characteristic (relationship between a direction and a gain) is investigated for the filter and the direction in which the null beam is formed is calculated.
The method of generating the spatial filter in step (S1) between step (S1) and step (S2) described above will be described later. First, the process of step (S2) will be described below.
FIG. 2 is a diagram illustrating a recording environment of the observation signals used to generate the spatial filter (FIG. 3) in which a null beam is directed in the sound source direction. Four microphones 22 and two sound sources (both human voices) are present. Further, the sound source direction is the direction of arrival viewed from a center 21 of the array of the microphones 22. When 0° is set in a vertical direction 24 with respect to an array direction 23 parallel to the array of the microphones, the counterclockwise direction is a positive (+) direction and the clockwise direction is a negative (−) direction.
FIG. 3 is a diagram illustrating the directivity characteristic of the spatial filter in which the null beam is directed in the sound source direction, that is, a plotted relationship between a direction (horizontal axis) and a gain (vertical axis). The vertical axis is expressed by logarithm. A method of generating directivity characteristic plot will be described later. Hereinafter, the spatial filter in which the null beam is directed in the sound source is referred to as a “null beam forming filter” and the plot of the directivity characteristic of this filter is referred to as a “null beam forming pattern.”
A portion in which the gain sharply falls in the null beam forming pattern 31 shown in FIG. 3 expresses a direction in which sensitivity is low, that is, a null beam. In the drawing, a deep “valley” is present in a vicinity 32 of a direction=−24° and a vicinity 33 of a direction=+12°. The valleys indicate the null beams corresponding to a sound source 1, 25 and a sound source 2, 26 in FIG. 2.
That is, a direction θ1 of the sound source 1 is about −24° and a direction θ2 of the sound source 2 is about +12°. In other words, the blocks corresponding to the null beam forming pattern have direction points of −24° and +12°, respectively.
In the MUSIC method, the inverse number of the gain may be used instead of the logarithm of the gain. For example, the inverse number is used in Japanese Unexamined Patent Application Publication No. 2008-175733 described above. In this case, the null beam is expressed as a sharp “mountain” on a graph. Here, a method of using the logarithm of the gain will be described in comparison to the present technique.
When the direction point of each block is calculated in this way, the direction points having the similar value are connected to each other between the blocks. For example, when the direction points having the value close to the direction=−24° are connected to each other in the environment shown in FIG. 2, the human utterance section corresponding to the sound source 1, 25 shown in FIG. 2 is calculated. When the direction points having the value close to the direction=+12° are connected to each other, the human utterance section corresponding to the sound source 2, 26 is calculated.