Technical Field
The present disclosure relates to a sound pickup device, program recorded medium, and method, and is applicable to, for example, a sound pickup device, program recorded medium, or method that emphasizes sound in a specific area and suppresses sound outside of that area.
Related Art
A beamformer (BF hereafter) employing a microphone array is conventional technology that selectively picks up only sound from a specific direction (also referred to as a “target direction” below) in an environment in which plural sources of sound are present (see the following document: Asano Futoshi, “Acoustical Technology Series 16: Array Signal Processing for Acoustics—Localization, Tracking, and Separation of Sound Sources”, The Acoustical Society of Japan, published Feb. 25, 2011 by Corona Publishing). A BF is technology for forming directionality using time differences in signals arriving at respective microphones.
Conventional BFs can be broadly divided into two categories: addition-types and subtraction-types. Subtraction-type BFs in particular have the advantage of being able to give directionality using a small number of microphones compared to addition-type BFs. The device described by Japanese Patent Application Laid-open (JP-A) No. 2014-72708 is a device that applies a conventional subtraction-type BF.
Explanation is given below regarding an example of a configuration for a conventional subtraction-type BF.
FIG. 18 is an explanatory diagram illustrating a configuration example of a sound pickup device PS applying a conventional subtraction-type BF.
The sound pickup device PS illustrated in FIG. 18 extracts target sound (sound from a target direction) from output of a microphone array MA configured using two microphones M1, M2.
FIG. 18 illustrates the sound signals captured by the microphones M1 and M2 as x1 (t) and x2 (t), respectively. Moreover, the sound pickup device PS illustrated in FIG. 18 includes a delay device DEL and a subtraction device SUB.
The delay device DEL aligns phase difference in target sound by computing a time difference tiL between the signals x1 (t) and x2 (t) arriving at the respective microphones M1, M2, and adding a delay. Hereafter, the signal given by adding the time difference tiL worth of delay to x1 (t) is denoted x1 (t−τL).
The delay device DEL computes the time difference τL using Equation (1) below. In Equation (1) below, d denotes the distance between the microphones M1 and M2, c denotes the speed of sound, and τL denotes the amount of delay. Moreover, in Equation (1) below, θL denotes the angle formed between a direction orthogonal to a straight line connecting the microphones M1, M2 together, and the target direction.τL=(d sin θL)/c  (1)
Here, delay processing is performed on the input signal x1 (t) of the microphone M1 when a blind spot is present facing the microphone M1 from the center (central point) between the microphones M1, M2. The subtraction device SUB, for example, performs processing that subtracts x1 (t−τL) from x2 (t) using Equation (2) below.α(t)=x2(t)−x1(t−τL)  (2)
The subtraction device SUB can also perform subtraction processing in the frequency domain. In such cases, Equation (2) above can be represented by Equation (3) below.A(ω)=X2(ω)−e−jωτLX1(ω)  (3)
Here, when θL=±π/2, the directionality formed by the microphone array MA is like that illustrated in FIG. 19A, forming unidirectionality with the form of a cardioid. On the other hand, when θL=0, π, the directionality formed by the microphone array MA is bidirectional in a figure-eight like that illustrated in FIG. 19B. Hereafter, filters that give unidirectionality from an input signal are referred to as unidirectional filters, and filters that give bidirectionality are referred to as bidirectional filters. Moreover, in the subtraction device SUB, strong directionality can also be formed at the blind spot of bidirectionality using spectral subtraction (also referred to as simply “SS” hereafter) processing.
The subtraction device SUB can perform subtraction processing using Equation (4) below when directionality is formed using SS. Although the input signal X1 of the microphone M1 is employed in Equation (4) below, similar effects can also be obtained for the input signal X2 of the microphone M2. In Equation (4) below, β is a coefficient for adjusting the strength of the SS. The subtraction device SUB may perform processing to substitute in 0 or a value reduced from the original value (flooring processing) when the result value from performing the subtraction processing employing Equation (4) below is negative. In the subtraction device SUB, by performing subtraction processing using the SS method, target area sound can be emphasized by extracting sound present in directions other than that of the target area, and subtracting the amplitude spectrum of the extracted sounds (sounds present in directions other than that of the target area) from the amplitude spectrum of the input signal.|Y(ω)|=|X1(ω)|−β|A(ω)|  (4)
In conventional sound pickup devices, when desiring to only pickup sound present within a specific area (referred to as “target area sound” hereafter), when using a subtraction-type BF alone, the possibility remains that sound sources present in the surroundings of the target area (referred to as “non-target area sound” hereafter) might also be picked up.
Thus, for example, JP-A No. 2014-72708 proposes processing that picks up target area sound (referred to as “target area sound pickup processing” hereafter) by using plural microphone arrays to cause directionalities to face toward the target area from separate individual directions, and to cause the directionalities to intersect at the target area as illustrated in FIG. 20. In this method, first, a power ratio is estimated for target area sound included in the BF output of the respective microphone arrays, to give a correction coefficient.
FIG. 20 illustrates an example of conventional technology in which target area sound is picked up using two microphone arrays MA1, MA2. When two microphone arrays MA1, MA2 are employed to pick up target area sound with target area sound as the sound source, the correction coefficients for the target area sound power are, for example, computed by Equation (5) and (6), or by Equation (7) and (8) below.
                                                        α              1                        ⁡                          (              n              )                                =                                                    mode                ⁡                                  (                                                                                    Y                                                  2                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                                                            Y                                                  1                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                              )                                            ⁢                                                          ⁢              k                        =            1                          ,        2        ,        …        ⁢                                  ,        N                            (        5        )                                                                    α              2                        ⁡                          (              n              )                                =                                                    mode                ⁡                                  (                                                                                    Y                                                  1                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                                                            Y                                                  2                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                              )                                            ⁢                                                          ⁢              k                        =            1                          ,        2        ,        …        ⁢                                  ,        N                            (        6        )                                                                    α              1                        ⁡                          (              n              )                                ⁢                                          =                                                    median                ⁡                                  (                                                                                    Y                                                  2                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                                                            Y                                                  1                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                              )                                            ⁢                                                          ⁢              k                        =            1                          ,        2        ,        …        ⁢                                  ,        N                            (        7        )                                                                    α              2                        ⁡                          (              n              )                                =                                                    median                ⁡                                  (                                                                                    Y                                                  1                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                                                            Y                                                  2                          ⁢                          k                                                                    ⁡                                              (                        n                        )                                                                              )                                            ⁢                                                          ⁢              k                        =            1                          ,        2        ,        …        ⁢                                  ,        N                            (        8        )            
In Equations (5) to (8) above, Y1k (n) and Y2k (n) represent the BF output amplitude spectra of the microphone arrays MA1 and MA2, N represents the total number of frequency bins, k represents frequency, and α1 (n) and α2 (n) represent power correction coefficients for the respective BF outputs. In Equations (5) to (8) above, mode represents the most frequent value, and median represents the central value. Next, the respective BF outputs are corrected using the correction coefficients, and non-target area sound present in the target direction can be extracted by performing SS. Target area sound can also be extracted by performing SS of the extracted non-target area sound from the respective BF outputs. In the extraction of a non-target area sound N1 (n) present in the target direction as viewed from the microphone array MA1, the product of the power correction coefficient α2 multiplied by the BF output Y2 (n) of the microphone array MA2, is subtracted from the BF output Y1 (n) of the microphone array MA1 by SS as indicated by Equation (9) below. Similarly, non-target area sound N2 (n) present in the target direction as viewed from the microphone array MA2 is extracted according to Equation (10) below.N1(n)=Y1(n)−α2(n)Y2(n)  (9)N2(n)=Y2(n)−α1(n)Y1(n)  (10)
Next, the target area sound pickup signals Z1 (n), Z2 (n) are extracted by SS of non-target area sound from the respective BF outputs Y1 (n), Y2 (n), according to Equations (11) and (12). Note that in Equations (11) and (12) below, γ1 (n), γ2 (n) are coefficients for changing the strength of the SS.Z1(n)=Y1(n)−γ1(n)N1(n)  (11)Z2(n)=Y2(n)−γ2(n)N2(n)  (12)
As described above, when the technology described by JP-A No. 2014-72708 is employed, sound pickup processing can be performed for target area sound even when non-target area sound is present in the surroundings of the area that is the target.
However, even when the technology described by JP-A No. 2014-72708 is employed, when background noise is strong (for example, when the target area is a place where there are many people such as an event venue, or a place where music is playing in the surroundings), noise that cannot be fully eliminated by the target area sound pickup processing results in unpleasant abnormal sounds, such as musical noise, occurring. In conventional sound pickup devices, although these abnormal sounds are masked to some extent by target area sound, there is a possibility of annoyance to the listener when target area sound is not present, since only the abnormal sounds will be audible.
Thus a sound pickup device, program recorded medium, and method are desired that suppress pickup of background noise components even when strong background noise is present in the surroundings of a sound source of target sound.