The present disclosure relates to a sound signal processing device, method, and program. More specifically, it relates to a sound signal processing device, method, and program for performing sound source extraction processing.
The sound source extraction processing is used to extract one target source signal from signals (hereinafter referred to as “observation signals” or “mixed signals”) in which a plurality of source signals are mixed to be observed with one or more microphones. Hereinafter, the target source signal (that is, the signal desired to be extracted) is referred to as a “target sound” and the other source signals are referred to as “interference sounds”.
One of problems to be solved by the sound signal processing device is to accurately extract a target sound if its sound source direction and segment are known to some extent in an environment in which there are a plurality of sound sources.
In other words, it is to leave only a target sound by removing interference sounds from observation signals in which the target sound and the interference sounds are mixed, by using information of a sound source direction and a segment.
The sound source direction as referred to here means a direction of arrival (DOA) as viewed from the microphone and the segment means a couple of a sound starting time (start to be active) and a sound ending time (end being active) and a signal included in the lapse of time.
For example, the following conventional technologies are available which discloses processing to estimate the direction and detect the segment of a plurality of sound sources.
(Conventional Approach 1) Approach Using an Image, in Particular, a Position of the Face and Movement of the Lips
This approach is disclosed in, for example, Patent Document 1 (Japanese Patent Application Laid-Open No. 10-51889). Specifically, by this approach, a direction in which the face exists is judged as the sound source direction and the segment during which the lips are moving is regarded as an utterance segment.
(Conventional Approach 2) Detection of Speech Segment Based on Estimated Sound Source Direction Accommodating a Plurality of Sound Sources
This approach is disclosed in, for example, Patent Document 2 (Japanese Patent Application Laid-Open No. 2010-121975). Specifically, by this approach, an observation signal is subdivided into blocks each of which has a predetermined length to estimate the directions of a plurality of sound sources for each of the blocks. Next, directions of the sound sources are tracked to interconnect them in the nearer directions in each block.
The following will describe the above problems, that is, to “accurately extract a target sound if its sound source direction and segment are known to some extent in an environment in which there are a plurality of sound sources”.
The problem will be described in order of the following items:
A. Details of the problem
B. Specific example of problem solving processing to which the conventional technologies are applied
C. Problems of the conventional technologies
[A. Details of the Problem]
A description will be given in detail of the problem of the technology of the present disclosure with reference to FIG. 1.
It is assumed that there are a plurality of sound sources (signal generation sources) in an environment. One of the sound sources is a “sound source of a target sound 11” which generates the target sound and the others are “sound sources of interference sounds 14” which generate the interference sounds.
It is assumed that the number of the target sound sources 11 is one and that of the interference sounds is at least one. Although FIG. 1 shows one “sound source of the interference sound 14”, any other interference sounds may exist.
The direction of arrival of the target sound is assumed to be known and expressed by variable θ. In FIG. 1, the sound source direction θ is denoted by numeral 12. The reference direction (line denoting direction=0) may be set arbitrarily. In FIG. 1 it is set as a reference direction 13.
If a sound source direction of the sound source of a target sound 11 is a value estimated by utilizing, for example, the above approaches, that is, any one of the:
(conventional approach 1) using an image, in particular, a position of the face and movement of the lips, and
(conventional approach 2) detection of speech segment based on estimated sound source direction accommodating a plurality of sound sources, there is a possibility that θ may contain an error. For example, even if θ=π/6 radian (=30°), there is a possibility that a true sound source direction may be a different value (for example, 35°).
Although the direction of the interference sound is yet to be known, it is assumed that it contains an error even if it is known. This holds true also with the segment. For example, even in an environment in which the interference sound is active, there is a possibility that only its partial segment may be detected or segment of it may be detected.
As shown in FIG. 1, n number of microphones are prepared. They are the microphones 1 to n denoted by numerals 15 to 17 respectively. Further, the relative positions among the microphones are known.
Next, a description will be given of variables which are used in the sound source extraction processing with reference to the following equations (1.1 to 1.3).
In the specification, A_b denotes an expression in which subscript suffix b is set to A, and A^b denotes an expression in which superscript suffix b is set to A.
                              X          ⁡                      (                          ω              ,              t                        )                          =                  [                                                                                          X                    1                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                                      ⋮                                                                                                          X                    n                                    ⁡                                      (                                          ω                      ,                      t                                        )                                                                                ]                                    [        1.1        ]                                          Y          ⁡                      (                          ω              ,              t                        )                          =                              W            ⁡                          (              ω              )                                ⁢                      X            ⁡                          (                              ω                ,                t                            )                                                          [        1.2        ]                                          W          ⁡                      (            ω            )                          =                  [                                                    W                1                            ⁡                              (                ω                )                                      ,            …            ⁢                                                  ,                                          W                n                            ⁡                              (                ω                )                                              ]                                    [        1.3        ]            
Let x_k(τ) be a signal observed with the k-th microphone, where τ is time.
By performing short-time Fourier transform (STFT) on the signal (which is detailed later), an observation signal Xk(ω, t) in the time-frequency domain is obtained, where
ω is a frequency bin number, and
t is a frame number.
Let X(ω, t) be a column vector of X_1(ω, t) to X_n(ω, t), which is an observation signal with each microphone (Equation [1.1]).
By extraction of sound sources according to the present disclosure, basically, an extraction result Y(ω, t) is obtained by multiplying the observation signal X(ω, t) by an extracting filter W (ω) (Equation [1.2]), where the extracting filter W(ω) is a row vector including n number of elements and denoted as Equation [1.3].
The various approaches for extracting sound sources can be classified on the basis of a difference in method for calculating the extracting filter W(ω) basically.
[B. Specific Example of Problem Solving Processing to which Conventional Technologies are Applied]
The approaches for realizing processing to extract a target sound from mixed signals from a plurality of sound sources are roughly classified into the following two approaches:
B1. sound source extraction approach and
B2. sound source separation approach.
The following will describe conventional technologies to which those approaches are applied.
(B1. Sound Source Extraction Approach)
As the sound source extraction approach for extracting sound sources by using known sound source direction and segment, the following are known, for example:
B1-1: Delay-and-sum array;
B1-2: Minimum variance beamformer;
B1-3: Maximum SNR beamformer;
B1-4: Approach based on target sound removal and subtraction; and
B1-5: Time-frequency masking based on phase difference.
Those approaches all use a microphone array (in which a plurality of microphones are disposed to the different positions). For their details, see Patent Document 3 (Japanese Patent Application Laid-Open No. 2006-72163).
The following will outline those approaches.
(B1-1. Delay-and-sum Array)
If the different time delays are given to signals observed with the different microphones and those observation signals are summed in condition where phases of the signals in a direction of a target sound are aligned, the target sound is emphasized because of aligned phase and sound from in other directions are attenuated because they are shifted in phase respectively.
Specifically, letting S(ω,θ) be a steering vector corresponding to a direction θ (which is a vector giving a difference in phase between the microphones on a sound coming in a direction and will be detailed later), an extraction result is obtained by using the following equation [2.1].
                              Y          ⁡                      (                          ω              ,              t                        )                          =                                            S              ⁡                              (                                  ω                  ,                  θ                                )                                      H                    ⁢                      X            ⁡                          (                              ω                ,                t                            )                                                          [        2.1        ]                                          Y          ⁡                      (                          ω              ,              t                        )                          =                              M            ⁡                          (                              ω                ,                t                            )                                ⁢                                    X              k                        ⁡                          (                              ω                ,                t                            )                                                          [        2.2        ]                                angle        ⁡                  (                                                    X                2                            ⁡                              (                                  ω                  ,                  t                                )                                                                    X                1                            ⁡                              (                                  ω                  ,                  t                                )                                              )                                    [        2.3        ]                                          N          ⁡                      (            ω            )                          =                  [                                                                      S                  ⁡                                      (                                          ω                      ,                                              θ                        1                                                              )                                                                              …                                                                                  S                    ⁡                                          (                                              ω                        ,                                                  θ                          m                                                                    )                                                        ]                                                                                        [        2.4        ]                                          Z          ⁡                      (                          ω              ,              t                        )                          =                                            N              ⁡                              (                ω                )                                      #                    ⁢                      X            ⁡                          (                              ω                ,                t                            )                                                          [        2.5        ]            
In this equation, superscript “H” denotes Hermitian transpose, by which a vector or matrix is transposed and its elements are transformed into conjugate complex numbers.
(B1-2. Minimum Variance Beamformer)
By this approach, only a target sound is extracted by forming a filter which has a gain 1 (which means no emphasis nor attenuation) in the direction of a target sound and a null beam (which means a direction having a lower sensitivity and is referred to a null beam also) in the direction of an interference sound.
(B1-3. Maximum SNR Beamformer)
By this approach, a filter W(ω) is obtained which maximizes V_s(ω)/V_n(ω), which is a ratio between the following a) and b):
a) V_s(ω): Variance of a result obtained by applying an extracting filter W(ω) to a segment where only the target sound is active
b) V_n(ω): Variance of a result obtained by applying the extracting filter W(ω) to a segment where only the interference sound is active
By this approach, the direction of the target sound is unnecessary if the respective segments can be detected.
(B1-4. Approach Based on Removal and Subtraction of Target Sound)
A signal (target sound-removed signal) obtained by removing the target sound from the observation signals is formed once and then this target sound-removed signal is subtracted from the observation signal (or a signal in which the target sound is emphasized by a delay-and-sum array etc.), thereby giving only the target sound.
By the Griffith-Jim beamformer, which is one of the approaches, ordinary subtraction is used as a subtraction method. There is another approach such as a spectral subtraction etc., by which nonlinear subtraction is used.
(B1-5. Time-frequency Masking Based on Phase Difference)
By the frequency masking approach, the different frequencies are multiplied by the different coefficients to mask (suppress) the frequency components dominant in the interference sound while leaving the frequency components dominant in the target sound, thereby extracting the target sound.
By the time-frequency masking approach, the masking coefficient is not fixed but changed as time passes by, so that letting M(ω, t) be the masking coefficient, extraction can be denoted by Equation [2.2]. As the second term in the right-hand side, an extraction result by means of any other approach other than X_k(ω, t) may be used. For example, the extraction result by use of the delay-and-sum array (Equation [2.1]) may be multiplied by the mask M(ω, t).
Generally, the sound signal is sparse both in the frequency direction and in the time direction, so that even if the target sound and the interference sound become active simultaneously, there are many cases where the target sound is dominant time-wise and frequency-wise. Some methods for finding such times and frequencies would use a different in phase of the microphones.
For time-frequency masking by use of phase difference, see, for example, “Variant 1. Frequency Masking” described in Patent Document 4 (Japanese Patent Application Laid-Open No. 2010-20294). Although this example would calculate the masking coefficient from a sound source direction and a phase different which are obtained by independent component analysis (ICA), the phase difference obtained by any other approach can be applied. The following will describe the frequency masking from a viewpoint of sound source extraction.
For simplification, it is assumed that two microphones are used. That is, in FIG. 2, the number of the microphones (n) is two (n=2).
If there are no interference sounds, an inter-microphone phase difference plot and a frequency plot follow almost the same straight line. For example, if there is only one sound source of the target sound 11 in FIG. 1, a sound from the sound source arrives at the microphone 1 (denoted by numeral 15) first and, after a constant lapse of time, arrives at the microphone 2 (denoted by numeral 16).
By comparing signals observed by those two microphones:
signal observed by the microphone 1 (denoted by 15): X_1(ω, t), and
signal observed by the microphone 2 (denoted by 16): X_2(ω, t), it is found that X_2(ω, t) is delayed in phase.
Therefore, by calculating the phase difference between the two by using Equation [2.4] and plotting a relationship between the phase difference and the frequency bin number ω, a correspondence relationship shown in FIG. 2 can be obtained.
Phase difference dots 22 are on a straight line 21. A difference in arrival time depends on the sound source direction θ, so that the gradient of the straight line 21 also depends on the sound source direction θ. Angle (x) is a function to obtain the angle of deviation of a complex number x as follows:angle(Aexp(jα))=α
If there are interference sounds, the phase of the observation signal is affected by the interference sounds, so that the phase difference plot deviates from the straight line. The magnitude of the deviation is largely dependent on the influence of the interference sounds. In other words, if the dot of the phase difference at a frequency and at a time exists near the straight line, the interference sounds have small components at this frequency and at this time. Therefore, by generating and applying a mask that leaves the components at such a frequency and at such a time while suppressing the others, it is possible to leave only the components of a target sound.
FIG. 3 is an example where almost the same plot as FIG. 2 is provided in an environment where there are interference sounds. A straight line 31 is similar to the straight line 21 shown in FIG. 2 but has phase-difference dots deviated from the straight line owing to an influence of the interference sounds. For example, a dot 33 is one of them. A frequency bin having a dot largely deviated from the straight line 31 means that the interference sounds have a large component, so that such a frequency bin component is attenuated. For example, a shift between the phase difference dot and the straight line, that is, a shift 32 shown in FIG. 3 is calculated, so that the larger this value is, the nearer the M(ω, t) in Equation [2.2] is set to 0, inversely, the nearer the phase difference dot is to the straight line, the nearer the M(ω, t) is set to 1.
Time-and-frequency masking has an advantage in that it involves a smaller computational cost than the minimum variance beamformer and the ICA and can also remove non-directional interference sounds (environmental noise etc., sounds whose sound source directions are unclear). On the other hand, it has a problem in that it involves occurrence of discontinuous portions in the spectrum and, therefore, is prone to occurrence of musical noise at the time of recovery to waveforms.
(B2. Sound Source Separation Approach)
Although the conventional sound source extraction approaches have been described above, a variety of sound source separation approaches can be applied in some cases. That is, after generating a plurality of sound sources becoming active simultaneously by the sound source separation approach, one target signal is selected by using information such as a sound source direction.
The following may be enumerated as the sound source separation approach.
B2-1. Independent component analysis (ICA)
B2-2. Null beamformer
B2-3. Geometric constrained source separation (GSS)
The following will outline those approaches.
(B2-1. Independent Component Analysis: ICA)
A separation matrix W(ω) is obtained so that each of the components of Y(ω), which is a result of applying W(ω), may be independent statistically. For details, see Japanese Patent Application Laid-Open No. 2006-238409. Further, for a method for obtaining a sound source direction from results of separation by use of ICA, see the above Patent Document 4 (Japanese Patent Application Laid-Open No. 2010-20294).
Besides the ordinary ICA approach for generation results of separation as many as the number of the microphones, a method referred to as a deflation method is available for extracting source signals one by one and used in analysis of signals as, for example, a magneto-encephalography (MEG). However, if the deflation method is applied simply to a signal in the time frequency domain, a phenomenon occurs that which one of the source signals is extracted first varies with the frequency bin. Therefore, the deflation method is not used in extraction of the time frequency signal.
(B2-2. Null Beamformer)
A matrix is generated in which steering vectors (whose generation method is described later) corresponding to sound source directions respectively are arranged horizontally, to obtain its (pseudo) inverse matrix, thereby separating an observation signal into the respective sound sources.
Specifically, letting θ_1 be the sound source direction of a target sound and θ_2 to θ_m be the sound source directions of interference sounds, a matrix N(ω) is generated in which steering vectors corresponding to the sound source directions respectively are arranged horizontally (Equation [2.4]). By multiplying the pseudo inverse matrix of N(ω) and the observation signal vector X(ω, t), a vector Z(ω, t) is obtained which has the separation results as its elements (Equation [2.5]). (In the equation, the superscript # denotes the pseudo inverse matrix.)
Since the direction of the target sound is θ_1, the target sound is the top element in the Z(ω, t).
Further, the first row of N(ω)^# provides a filter in which a null beam is formed in the directions of all of the sound sources other than the target sound.
(B2-3. Geometric Constrained Source Separation (GSS))
By obtaining a matrix W(ω) that satisfies the following two conditions, a separation filter can be obtained which is more accurate than the null beamformer.
a) W(ω) is a (pseudo) inverse matrix of N(ω).
b) W(ω) is statistically non-correlated with the application result Z(ω, t).
[C. Problems of Conventional Technologies]
Next, a description will be given of problems of the conventional technologies described above.
Although the above example has set the target sound's direction and segment to be known, they may not typically be obtained accurately. That is, there are the following problems.
1) The target sound's direction may be inaccurate (contain an error) in some cases.
2) The interference sound's segment may not typically be detected.
For example, by the method using an image, there is a possibility that a misalignment between the camera and the microphone array may give a disagreement between a sound source direction calculated from the face position and a sound source direction with respect to the microphone array. Further, the segment may not be detected for the sound source not related to the face position or the sound source out of the camera angle of field.
By the approach based on sound source direction estimation, there is trade-off between the accuracy of directions and its computational const. For example, if the MUSIC method is used for sound source direction estimation, by decreasing the angle steps in which the null beam is scanned, the accuracy is improved but the computational cost increases.
MUSIC stands for MUltiple SIgnal Classification. From the viewpoint of spatial filtering by which a sound in a specific direction is permitted to pass or suppressed, the MUSIC method may be described as processing including the following two steps (S1 and S2). For details of the MUSIC method, see Patent Document 5 (Japanese Patent Application Laid-Open No. 2008-175733) etc.
(S1) Generating a spatial filter that a null beam is directed to all of sound sources which are active in a certain segment (block), and
(S2) Scanning the directivity pattern (relationship between the direction and the gain) of the filter, to obtain a direction in which the null beam appears.
The sound source direction optimal to extraction varies with the frequency bin. Therefore, if only one sound source direction is obtained from all of the frequencies, a mismatch occurs between the optimal value and some of the frequency bins.
If the target sound direction is inaccurate or the interference sound may not be detected in such a manner, some of the conventional methods may be deteriorated in accuracy in extraction (or separation).
In the case of using sound source extraction as previous processing of any other processing (speech recognition or recording), the following requirements should preferably be satisfied:
low-delay (a small lapse of time elapses from the end of a segment to the generation of extraction results (or separation results); and
followability (high extraction accuracy is kept from the start of the segment)
However, none of the conventional methods has satisfied all of those requirements. The following will describe problems of the above approaches.
(C1. Problems of Delay-and-sum Array (B1-1))
Even with inaccurate directions, the influence is restrictive to some extent.
However, if a small number of (for example, three to five) microphones are used, the interference sounds are not attenuated so much. That is, this approach has only an effect of emphasizing the target sound to a small extent.
(C2. Problems of Minimum Variance Beamformer (B1-2))
If there is an error in the direction of a target sound, extraction accuracy decreases rapidly. This is because if a direction in which the gain is fixed to 1 disagrees with a true direction of the target sound, a null beam is formed also in the direction of the target sound to deteriorate the target sound also. That is, a ratio between the target sound and the interference sound (SNR) will not increase.
To address this problem, a method is available for learning an extracting filter by using an observation signal in a segment where the target sound is not active. However, in this case, all of the sound sources other than the target sound need to be active in this segment. In other words, the interference sound, if present only in the segment in which the target sound is active, may not be removed.
(C3. Problems of Maximum SNR Beamformer (B1-3))
It does not use a sound source direction and, therefore, is not affected even by inaccurate direction of the target sound.
However, it needs to give both of:
a) a segment in which only the target sound is active, and
b) segment in which all of the sound sources other than the target sound are active, and, therefore, may not be applied if any one of them may not be obtained. For example, if any one of the interference sounds is active almost at all times, a) may not be obtained. Further, if there is an interference sound active only in a segment in which the target sound is active, b) may not be obtained.
(C4. Problems of Approach Based on Removal and Subtraction of Target Sound (B1-4))
If there is an error in the direction of a target sound, extraction accuracy decreases rapidly. This is because if the direction of the target sound is inaccurate, the target sound is not completely removed, so that if the signal is subtracted from an observation signal, the target sound is also removed to some extent.
That is, the ratio between the target sound and the interference sound does not increase.
(C5. Problems of Time-frequency Masking Based on Phase Difference (B1-5))
This approached suffers from inaccurate directions but is not so much affected to some extent.
However, originally, there are not so large differences in phase between the microphones at low frequencies, so that accurate extraction is difficult.
Further, a discontinuous portion is liable to occur in a spectrum, so that there is a case where musical noise may occur at the time of recovery to waveforms.
There is another problem in that the spectrum of results of processing of time-frequency masking is different from a spectrum of a natural speech, so that if speech synthesis etc. is utilized at the latter stage, extraction is possible (interference sounds can be removed) but, in some cases, the accuracy of speech recognition may not be improved in some cases.
Moreover, there is a possibility that if the degree of overlapping between the target sound and the interference sound increases, masked portions increase, so that there is a possibility that a sound volume as a result of extraction may decrease of the degree of musical noise may increase.
(C6. Problems of Independent Component Analysis (ICA) (B2-1))
This approach does not use a sound source direction, so that no influence is given on separation even with inaccurate directions.
However, this approach involves larger computational cost than the other approaches and suffers from a large delay in batch processing (which uses observation signals all over the segments). Moreover, in the case of a single target sound, even though only one of n number of (n: number of microphones) separated signals is employed, the same computational cost and the same memory usage are necessary as those in a case where n number of them are used. Besides, this approach needs processing to select the signal and, therefore, involves the correspondingly increased computational cost and develops a possibility that a signal different from the target sound may be selected, which is referred to as selection error.
By providing real-time processing through applying shift or on-line algorithms described in Patent Document 6 (Japanese Patent Application Laid-Open No. 2008-147920), the latency can be reduced but tracking lag occurs. That is, a phenomenon occurs that a sound source which becomes active first has low extraction accuracy near the start of a segment and, as it gets nearer the end of the segment, the extraction accuracy increases.
(C7. Problems of Null Beamformer (B2-2))
If the direction of an interference sound is inaccurate, the separation accuracy decreases rapidly. This is because a null beam is formed in a direction different from the true direction of the interference sound and, therefore, the interference sound is not removed.
Further, the directions of all the sound sources in the segment including the interference sounds need to be known. The undetected sound sources are not removed.
(C8. Problems of Geometric Constrained Source Separation (GSS) (B2-3))
This approach suffers from inaccurate directions but is not so much affected to some extent.
In this approach also, the directions of all the sound sources in the segment including the interference sounds need to be known.
The above discussion may be summarized as follows: there has been no approach satisfying all of the following requirements.                Even with the inaccurate direction of a target sound, its influence is small.        Even if the segment and the direction of an interference sound are unknown, the target sound can be extracted.        Small latency and high tracking capability.        
For those technologies, see, for example, Japanese Patent Application Laid-Open No. 10-51889 (Document 1), Japanese Patent Application Laid-Open No. 2010-121975 (Document 2), Japanese Patent Application Laid-Open No. 2006-72163 (Document 3), Japanese Patent Application Laid-Open No. 2010-20294 (Document 4), Japanese Patent Application Laid-Open No. 2008-175733 (Document 5), and Japanese Patent Application Laid-Open No. 2008-147920 (Document 6).