The present disclosure relates to a sound signal processing apparatus, sound signal processing method, and program. More particularly, the present disclosure relates to a sound signal processing apparatus, sound signal processing method, and program for executing a sound source extraction process to isolate a specific sound from mixtures of multiple source signals, for example.
Sound source extraction is a process to extract a single target source signal from signals in which multiple source signals are mixed and which is observed with microphones (hereinafter referred to as observed signal or mixed signal). In the following description, a source signal as the target (that is, the signal to be extracted) will be referred to as target sound and the other source signals will be referred to as interfering sounds.
It is desirable to accurately extract the target sound when the sound source direction and segment of the target sound are known to some degree in an environment where multiple sound sources are present.
In other words, it is desirable to eliminate interfering sounds from observed signals in which the target sound and interfering sounds are mixed and leave only the target sound by use of information on sound source direction and/or segment.
Sound source direction used herein means the direction of arrival (DOA) for a sound source as seen from a microphone, and a segment refers to a pair of a start time of sound (when it starts being emitted) and an end time (when it stops being emitted) and signals falling in the time interval between them.
For direction estimation and segment detection in the case of multiple sound sources, a number of schemes have been already proposed. Listed below are some specific examples of related art.
(Related-Art Scheme 1) a Scheme Using Images, Especially Face Position and/or Lip Movement
A scheme of this type is disclosed in Japanese Unexamined Patent Application Publication No. 10-51889, for instance. Specifically, this scheme assumes that the direction in which the face is positioned is the sound source direction and the segment during which the lips are moving represents an utterance segment.
(Related-Art Scheme 2) Speech Segment Detection Based on Sound Source Direction Estimation Designed for Multiple Sound Sources
Disclosures of this scheme include Japanese Unexamined Patent Application Publication No. 2012-150237 and Japanese Unexamined Patent Application Publication No. 2010-121975, for instance. In this scheme, an observed signal is divided into blocks of a certain length and direction estimation designed for multiple sound sources is performed for each of the blocks. Then, temporal tracking is conducted in terms of sound source direction and adjacent direction points present at certain intervals on the time axis are connected across blocks.
Further related arts that disclose a sound source extraction process for extracting a particular sound source by making use of known sound source direction and speech segment include Japanese Unexamined Patent Application Publication No. 2012-234150 and Japanese Unexamined Patent Application Publication No. 2006-72163, for example.
Examples of specific processing with these techniques will be described later.
However, proposed related art is not capable of detecting the direction of the target sound and/or interfering sounds and/or their segments with high accuracy, inevitably calling for sound source extraction using sound source direction information or speech segment information of low accuracy. Related-art sound source extraction processes are however problematic because the accuracy of sound source extraction results obtained using sound source direction or speech segment information of low accuracy are also very low.