1. Field of the Invention
The present invention relates to a sound source separation apparatus and a sound source separation.
2. Description of the Related Art
When a plurality of sound sources and a plurality of microphones (equivalent to sound input units) in a predetermined sound space are present, a sound signal (hereinafter referred to as mixed sound signal) in which an individual sound signal (hereinafter referred to as sound source signal) from each of the plural sound sources is overlapped on another sound source signal is obtained from each of the plural microphones. A sound source separation method of (identifying) separating the respective sound source signals only on the basis of the thus obtained (input) plural mixed sound signals is called a blind source separation method, which will be hereinafter referred to as BSS-method. An example of a sound source separation process based on the sound input BSS method is a sound source separation process based on a method for an independent component analysis (hereinafter referred to as ICA).
The plural mixed sound signals (time-series (time-domain) sound signals) which are input through the plurality of microphones are statistically independent from each other. The sound separation process based on the ICA method includes a process for optimizing a predetermined separating matrix (inversed mixing matrix) through a learning computation on the basis of the input plural mixed sound signals on the promise that the mixed sound signals are statistically independent from each other. Furthermore, the sound separation process based on the ICA method includes performing a filter process (matrix operation) on the plural input mixed sound signals with use of the optimized matrix operation through the learning computation, thus identifying the sound signals (sound source separation).
Here, the optimization for the separating matrix based on the ICA method is performed through the learning computation, in which a calculation of a separation signal (identified signal) obtained by performing the filter process (matrix operation) on a mixed sound signal of a predetermined time length with use of on the separating matrix and an update of the separating matrix through an inverse matrix operation or the like with use of the separation signal are subsequently repeated.
The ICA method used for performing the sound source separation process based on the BSS method is roughly divided into an ICA method in Time-Domain (hereinafter referred to as the TDICA method) and an ICA method in Frequency-Domain (hereinafter referred to as FDICA method).
The TDICA method is a method with which the independence of the respective sound source signals over a wide frequency band in general. In the learning computation of the separating matrix, the convergence in the vicinity of the optimal point is high. For this reason, according to the TDICA method, it is possible to obtain the separating matrix with a high optimization level, and the sound source signals can be separated from each other at a high precision (high separation performance). However, the TDICA method requires an extremely complicated (high operational load) process for the learning computation of the separating matrix (a process for a convolutive mixture) and therefore is not suitable to a real time process.
On the other hand, the FDICA method, for example, disclosed in Japanese Unexamined Patent Publication Application No. 2003-271168, is a method for performing the learning computation of the separating matrix to change a problem of the convolutive mixture into a problem of instantaneous mixture for each of frequency bins which are frequency bands divided into plural pieces (which are sub bands in Japanese Unexamined Patent Publication Application No. 2003-271168) through a Fourier transform process for converting the mixed sound signal from the time-domain signal to the frequency-domain signal. According to this FDICA method, optimization (learning computation) of the separating matrix (the matrix to be used for the separation filter process) can be performed stably and also at a high speed. Therefore, the FDICA method is suitable to the real time sound source separation process.
Incidentally, according to the FDICA method, the number of the frequency bins (the number of the sub bands illustrated in Japanese Unexamined Patent Publication Application No. 2003-271168) in the frequency-domain mixed sound signal used for the learning computation of the separating matrix (hereinafter referred to as learning input signal) significantly affects the separation performance in a case where the filter process is performed with use of the separating matrix that is obtained through that learning computation. Here, it may be also mentioned that in the Fourier transform process, the number of the frequency bins of the output signal (the frequency-domain signal) is ½ times as many as the number of the samples of the input signal (the time-domain signal), and the number of the samples the mixed sound signal (the digital signal) that is the input of a Fourier transform process significantly affects the separation performance. Also, a sampling cycle at the time of A/D conversion of the mixed sound signal is constant, and therefore it may be mentioned that the time length of the mixed sound signal that is the input of the Fourier transform process significantly affects the separation performance.
For example, in a case where the sampling frequency of the mixed sound signal is 8 KHz, if the length (the frame length) of the input signal (the time-domain signal) of the Fourier transform process is set to about 1024 samples (128 ms in terms of time), that is, if the number of the frequency bins (the number of the sub bands) in the output signal (the frequency-domain signal) of the Fourier transform process is set to about 512, the high separation performance can be obtained (the separating matrix with the high separation performance can be obtained).
Next, while referring to FIG. 8, a description will be given of a conventional process procedure in a case of executing the sound source separation process based on the FDICA method in real time. FIG. 8 is a block diagram illustrating a conventional flow of a sound source separation process based on the FDICA method.
In an example illustrated in FIG. 8, the sound source separation process based on the FDICA method is executed by a learning computation unit 34, a second FFT processing unit 42′, a separation filter processing unit 44′, an IFFT processing unit 46′, and a synthesis process unit 48′. The learning computation unit 34, the second FFT processing unit 42′, the separation filter processing unit 44′, the IFFT processing unit 46′, and the synthesis process unit 48′ are composed, for example, of a computation processor such as a DSP (Digital Signal Processor), a storage unit such as a ROM that stores a program to be executed by the processor, and other peripheral devices such as an RAM.
Also, for the convenience of description, the respective buffers illustrated in FIG. 8 (a first input buffer 31, a first intermediate buffer 33, a second input buffer 41′, a second intermediate buffer 43′, a third intermediate buffer 45′, a fourth intermediate buffer 47′, and an output buffer 49′) are described as if the buffers can accumulate an extremely large amount of data. However, in actuality, data that is no longer necessary among the stored data is sequentially deleted in the respective buffers, and as a result the thus obtained free space is reused. Accordingly, the storage capacity of the respective buffers is set as a necessary and sufficient amount.
The mixed sound signal (the sound signal) of each channel digitalized at a constant sampling cycle is input (transmitted) to the first input buffer 31 and the second input buffer 41′ by N samples each. For example, in a case where the sampling frequency of the mixed sound signal is 8 KHz, N=about 512 is established. In this case, the time length of the mixed sound signal by the N samples is 64 ms.
Then, each time a new mixed sound signal by the N samples is input to the first input buffer 31, a first FFT processing unit 32 executes the Fourier transform process on the latest mixed sound signal by the 2N samples including the N samples (hereinafter referred to as first time-domain signal S0), and a frequency-domain signal that is the resultant of the process (hereinafter referred to as first frequency-domain signal Sf0) is temporarily stored in the first intermediate buffer 33. Here, in a case where the number of the signal samples accumulated in the first input buffer 31 does not reach 2N (an initial stage after the process start), the Fourier transform process is executed on a signal to which the value 0 is replenished by a deficient number. The number of the frequency bins of the first frequency-domain signal Sf0 obtained by performing the Fourier transform process once in the first FFT processing unit 32 is ½ times as many as the number of samples of the first frequency-domain signal Sf0 (=N).
Then, each time the first intermediate buffer 33 records the first frequency-domain signal Sf0 by a predetermined time length T [sec], on the basis of the signal Sf0 by T [sec], the learning computation unit 34 performs the learning computation of a separating matrix W(f), that is, filter coefficients (matrix components) constituting the separating matrix W(f). Furthermore, the learning computation unit 34 updates, at a predetermined timing, the separating matrix used in the separation filter processing unit 44′ into a separating matrix after the learning (that is, the value of the filter coefficients of the separating matrix is updated to the number after the learning). In a normal case, after the completion of the learning computation, immediately after the filter process of the separation filter processing unit 44′ is ended for the first time, the learning computation unit 34 updates the separating matrix.
On the other hand, each time a new mixed sound signal by the N samples is input to the second input buffer 41′, the second FFT processing unit 42′ also executes the Fourier transform process on the latest mixed sound signal by the 2N samples including the N samples (hereinafter referred to as second time-domain signal S1), and a frequency-domain signal that is the process result (hereinafter referred to as second frequency-domain signal Sf1) is temporarily stored in the second intermediate buffer 43′. In this manner, the second FFT processing unit 42′ executes the Fourier transform process on the second time-domain signal S1 (the mixed sound signal) in which time slots are overlapped one another by the N samples in sequence. Here, in a case where the number of the signal samples accumulated in the second input buffer 41′ does not reach 2N (an initial stage after the process start), the Fourier transform process is executed on a signal to which the value 0 is replenished by a deficient number. It should be noted that the number of the frequency bins of this second frequency-domain signal Sf1 is also ½ times as many as the number of the samples of the second frequency-domain signal Sf1 (=N).
Then, each time the second intermediate buffer 43′ records the new second frequency-domain signal Sf1, the separation filter processing unit 44′ performs a filter process (matrix operation) with use of the separating matrix on the new second frequency-domain signal Sf1, and a signal obtained through the process (hereinafter referred to as third frequency-domain signal Sf2) is temporarily stored in the third intermediate buffer 45′. The separating matrix used in this filter process is to be updated by the above-described learning computation unit 34. It should be noted that until the separating matrix is updated for the first time by the learning computation unit 34, the separation filter processing unit 44′ performs the filter process with use of the separating matrix (initial matrix) in which a predetermined initial value is set. Here, it is needless to mention that the second frequency-domain signal Sf1 and the third frequency-domain signal Sf2 have the same number of the frequency bins.
Also, each time the third intermediate buffer 45′ records the new third frequency-domain signal Sf2, the IFFT processing unit 46′ executes an inverse Fourier transform process on the new third frequency-domain signal Sf2, and a time-domain signal that is the resultant of the process (hereinafter referred to as third time-domain signal S2) is temporarily stored in the fourth intermediate buffer 47′. The number of this third time-domain signal S2 is 2 times as many as the number of the frequency bins (=N) of the third frequency-domain signal Sf2 (=2N). As described above, as the second FFT processing unit 42′ executes the Fourier transform process on the second time-domain signal S1 (the mixed sound signal) in which time slots are overlapped one another by the N samples, the time slots are mutually overlapped by the N samples in the two continuous third time-domain signals S2 recorded in the fourth intermediate buffer 47′.
Furthermore, each time the fourth intermediate buffer 47′ records the new third time-domain signal S2, the synthesis process unit 48′ executes a synthesis process to be illustrated below to generate a new separation signal S3, which is temporarily recorded in the output buffer 49′.
Here, the above-described synthesis process is a process for synthesizing both the signals at a part where the time slots are overlapped one another (a signal by the N samples each) in the new third time-domain signal S2 obtained in the IFFT processing unit 46′ and the third time-domain signal S2 obtained one time before, through addition by a crossfade weighting, for example. As a result, the smoothed separation signal S3 is obtained.
By way of the above-described process, although some delay is (time delay) is caused with respect to the mixed sound signal, the separation signal S3 corresponding to the sound source is recorded in the output buffer 49′ in real time.
Also, the separating matrix used in the filter process is appropriately updated so as to be adapted to a change in acoustic environment by the learning computation unit 34.
Next, while referring to FIGS. 9A to 9E, the output delay illustrated in FIG. 8 caused by the conventional sound source separation process will be described. FIGS. 9A to 9E are block diagrams illustrating a state transition of the signal input and output in a conventional sound source separation process based on the FDICA method.
Here, the output delay refers to a delay from a time point when the mixed sound signal is generated to a separation signal separated and generated from the mixed sound signal is output.
Hereinafter, a buffer for temporarily storing the mixed sound signal (the digital signal) obtained through an A/D conversion process is denoted by an input buffer 23. From this input buffer 23, the mixed sound signal by the N samples is transferred to the first input buffer 31 and the second input buffer 41′. Also, in FIGS. 9A to 9E, an input point Pt1 represents a signal write position with respect to the input buffer 23 (an instruction position of a write pointer), and an output point Pt2 represents a signal read position from the output buffer 49′ (an instruction position of a read pointer). The input point Pt1 and the output point Pt2 are sequentially moved in synchronism with the same cycle as the sampling cycle of the mixed sound signal. Also, the input point Pt1 and the output point Pt2 are cyclically moved in each of the input buffer 23 and the output buffer 49′ having a storage capacity of 2N samples.
FIG. 9A represents a state at the time of the process start. No signals are accumulated in both the input buffer 23 and the output buffer 49′ (for example, a state where value 0 is embedded).
FIG. 9B represents a state after the state of FIG. 9A, in which new signals are written in the input buffer 23 in accordance with the movement of the input point Pt1 in sequence and the signal by the N samples is accumulated. At this time, the signal by the N samples (the signal denoted by input (1) in the drawing) is transferred to a unit for performing the sound source separation process (hereinafter referred to as sound source separation process unit A), and the sound source separation process is executed.
To be more specific, the signal by the N samples is transferred to (recorded in) the first input buffer 31 and the second input buffer 41′, and the sound source separation process described on the basis of FIG. 8 is executed. Also, in the input buffer 23, the signal after the transfer to the sound source separation process unit A is ended is deleted.
FIG. 9C represents a state after the state of FIG. 9B, in which the sound source separation process unit A generates a separation signal by the N samples (the signal denoted by output (1) in the drawing), and the separation signal is written in the output buffer 49′. This separation signal (the output (1)) is equivalent to the separation signal S3 in FIG. 8.
In this state of FIG. 9C, the output point Pt2 is at a position where the separation signal is not written, and therefore the separation signal (the output (1)) is not output yet.
FIG. 9D represents a state after the state of FIG. 9C, in which a further new signal is written in the input buffer 23, and the next signal by the N samples (the signal denoted by input (2) in the drawing) is accumulated. At this time, the next signal by the N samples (the input (2)) is transferred to the sound source separation process unit A, and the sound source separation process is executed.
In this state of FIG. 9D, as the output point Pt2 is at the write position of the previous separation signal (the output (1)), the output of the separation signal (the output (1)) is started.
FIG. 9E represents a state after the state of FIG. 9D, in which a new separation signal by the N samples is generated by the sound source separation process unit A (the signal denoted by output (2) in the drawing), and the separation signal is written in the output buffer 49′. Between the time point of FIG. 9D to the time point of FIG. 9E, in accordance with the movement of the output point Pt2, the previous separation signal (the output (1)) is sequentially output by 1 sample each. Also, the signal after the output is ended is deleted in the output buffer 49′.
As is apparent from FIGS. 9A to 9E, in the conventional sound source separation process, the output delay equivalent to the time length of the next signal by the 2N samples is caused between the time point of FIG. 9A to the time point of FIG. 9D with respect to the signal delivery and receipt in the prior stage and the subsequent stage of the sound source separation process unit A. Furthermore, in the sound source separation process unit A as well, through the above-described synthesis process performed by the synthesis process unit 48′, the output delay equivalent to the time length of the next signal by the N samples is caused. Therefore, in the conventional sound source separation process, there is a problem in that the output delay equivalent to the time length of the next signal by the 3N samples is caused in total.
For example, when the sampling frequency of the signal is 8 KHz, if the 1 frame is set as the signal of 1024 samples (that is, N=512) so that the separating matrix with the high separation performance can be obtained through the FDICA method, the output delay of 192 [msec] is caused.
This output delay of 192[msec] is a hardly accepted delay in an apparatus that operates in real time. For example, a delay time in communication in a digital mobile phone is, in general, equal to or smaller than 50 [msec]. When the sound source separation based on the conventional FDICA method is applied to this digital mobile phone, the total delay time becomes 242 [msec], which is unpractical. In a similar way, when the sound source separation based on the conventional FDICA method is applied to a hearing aid as well, a time deviation between an image viewed by eyes of the user and a sound which is heard through the hearing aid is too large, which is unpractical.
Here, by setting a positional relation between the input point Pt1 and the output point Pt2 different from a positional relation illustrated in FIGS. 9A to 9E in advance, the output delay can be set equal to or smaller than the time length of the next signal by the 3N samples. However, in that case too, the output delay is merely shortened to a time obtained by adding a time required to perform the sound source separation process to the time length of the next signal by the 2N samples. That is, according to the sound source separation process based on the FDICA method, the time of the output delay becomes a time more than 2 times or about 3 times as longer as the execution cycle of the Fourier transform process (the process of the second FFT processing unit 42′) for obtaining the frequency-domain signal Sf1 used as the input signal of the filter process (the time length tN of the signal by the N samples).
On the other hand, the time of the output delay can be shortened when the length of 1 frame is set short (the number of samples is set small). However, the shortening of the length of 1 frame causes a problem in that the sound source separation performance is deteriorated.