A voice recognition process is performed by analyzing a human speech content of an acquired sound from a microphone, for example. The voice recognition is available in a variety of apparatuses. For example, by disposing a voice recognition unit on an information process apparatus such as a mobile terminal and a television, and analyzing a user spoken word (a user's speech) on the apparatus, it is possible to execute a process based on the speech on the information process apparatus such as the mobile terminal and the television.
The acquired sound of the microphone includes not only a user's speech voice to be recognized (called as a target sound), but also undesired sounds (called as a noise, an environmental sound, a disturbing sound, etc.). It is difficult to extract the target sound, i.e., a specific user's speech, from a mixed signal including undesired sounds from a variety of sound sources. Under an environment where a lot of undesired sounds, a voice recognition accuracy is undesirably lowered. The greater the distance from the microphone to the user's mouth is, the easier the disturbing sound mixed is. The problem will become more difficult.
In addition, if the distance from the microphone to the user becomes great, it becomes difficult to install a button for explicitly designate the beginning and the end of the sound input. Therefore, other means are necessary to detect the beginning and the end of the speech.
In order to improve the voice recognition accuracy under the environment, it is effective to apply the following processes, for example, which has been suggested in the related art:
(a) A voice segment detection process for specifying a segment for the voice recognition process
(b) A sound source separation process or a sound source extracting process for extracting only the target sound from the sound signal where a variety of sounds are mixed generated from a plurality of the sound sources.
These processes are performed before the voice recognition process, thereby allowing the sound signal for the voice recognition to be temporally and spatially selected and to improve the recognition accuracy of the target sound.
The voice segment detection process is described, for example, in Patent Document 1 (Japanese Patent Application Laid-open No. 2012-150237) and Patent Document 2 (Japanese Patent No. 4182444).
The sound source separation process and the sound source extraction process are described in Patent Document 3 (Japanese Patent Application Laid-open No. 2011-107602).
The related art that discloses the voice recognition process is, for example, Patent Document 4 (Japanese Patent Application Laid-open No. 2001-242883), Patent Document 5 (Japanese Patent Application Laid-open No. 2006-053203), and Patent Document 6 (Japanese Patent Application Laid-open No. 2011-033680).
Patent Document 1: Japanese Patent Application Laid-open No. 2012-150237
Patent Document 2: 4182444
Patent Document 3: Japanese Patent Application Laid-open No. 2011-107602
Patent Document 4: Japanese Patent Application Laid-open No. 2001-242883
Patent Document 5: Japanese Patent Application Laid-open No. 2006-053203
Patent Document 6: Japanese Patent Application Laid-open No. 2011-033680