In recent years, a hearing aid is configured to be able to form a directivity of sensitivity from input signals given by a plurality of microphone units (for example, see Patent Literature 1). A sound source which a wearer wants to hear using the hearing aid is mainly the voice of a person with whom the wearer of the hearing aid is speaking. Therefore, the hearing aid is desired to perform control in synchronization with the function for detecting conversation in order to effectively use directivity processing.
Conventionally, a method for sensing the situation of conversation includes a method using a camera and a microphone (for example, see Patent Literature 2). An information processing apparatus described in Patent Literature 2 processes a video provided by a camera and estimates an eye gaze direction of a person. When a conversation is held, it is considered that a conversing person tends to reside in the eye gaze direction. However, it is necessary to add an image capturing device, and therefore, this approach is inappropriate for the purpose of the hearing aid.
On the other hand, a direction from which a voice is heard can be estimated with a plurality of microphones (microphone array), a conversing person can be extracted from this estimation result information at a conference. However, the speech has a property of spreading. For this reason, in a case where there are a plurality of conversation groups such as conversations in a coffee shop, it is difficult to distinguish between words spoken to the wearer and words spoken to persons other than the wearer by determining only the arriving direction. The arriving direction of the voice perceived by the person who receives the speech does not represent the direction of the face of the person who spoke the voice. Since this point is different from video input which allows direct estimation of the directions of the face and the eye gaze, the approach to the detection of the conversing person based on the sound input is difficult.
For example, a conventional conversing person detection apparatus based on sound input in view of existence of interference sound includes a speech signal processing apparatus described in Patent Literature 3. The speech signal processing apparatus described in Patent Literature 3 determines whether a conversation is held or not by separating sound sources by processing input signals from the microphone array and calculating the degree of establishment of conversation between two sound sources.
The speech signal processing apparatus described in Patent Literature 3 extracts an effective speech in which a conversation is established under an environment where a plurality of speech signals from a plurality of sound sources are input in a mixed manner. This speech signal processing apparatus performs numerical conversion from a time-series of speeches in view of the property that holding a conversation is as if “playing catch”.
FIG. 1 is a figure illustrating a configuration of a speech signal processing apparatus described in Patent Literature 3.
As shown in FIG. 1, speech signal processing apparatus 10 includes microphone array 11, sound source separation section 12, speech detection sections 13, 14, and 15 for respective sound sources, conversation establishment degree calculation sections 16, 17, and 18 each given for two sound sources, and effective speech extraction section 19.
Sound source separation section 12 separates plurality of sound sources that are input from microphone arrays 11.
Speech detection sections 13, 14, and 15 determine presence of speech/absence of speech in each sound source.
Conversation establishment degree calculation sections 16, 17, and 18 calculate conversation establishment degrees each given for two sound sources.
Effective speech extraction section 19 extracts a speech having the highest conversation establishment degree as effective speech from the conversation establishment degree each given for two sound sources.
Known methods for separating sound sources include a method using ICA (Independent Component Analysis) and a method using ABF (Adaptive Beamformer). The principle of operation of both of them is known to be similar (for example, see Non-Patent Literature 1).