Technical Field
The present invention relates to a speech recognition model construction method, speech recognition method, computer system, speech recognition apparatus, program, and recording medium, and, more particularly, to a construction method, computer system, program, and recording medium for constructing a speech recognition model for mixed speech of mixed speakers, as well as to a speech recognition method, speech recognition apparatus, and program which use the constructed speech recognition model.
Description of Related Art
Conventionally, in call recording at call centers and the like, although stereo recording is also used to record outgoing speech and incoming speech on separate channels, monaural recording is often used to record outgoing speech and incoming speech in a mixed state from the viewpoint of data volume.
However, it is known that speech recognition of monaural sound is inferior in accuracy to stereo sound. A cause of this is that speech recognition fails in and around a portion where utterances of plural speakers overlap (simultaneous speech segment). Some speech recognition such as large vocabulary continuous speech recognition improves recognition accuracy by utilizing connections among words. However, if, for example, misrecognition occurs in a portion where utterances overlap, recognition results of preceding and following portions will be affected, which could result in a burst error.
Additionally, a manual transcript of utterance content in stereo sound is relatively usable. However, in the case of monaural sound, manual labor cost of labeling simultaneous speech segments and utterance content is very high, and thus it is difficult to obtain sufficient quantity of correct labeled data for training.
Against this background, there is demand for development of a technique which can efficiently train a speech recognition model capable of discriminating simultaneous speech segments in monaural sound with high accuracy.
A number of speech recognition techniques are known hitherto. For example, WO2004/075168A1 (Patent Literature 1) discloses a speech recognition technique which uses a garbage acoustic model, which is an acoustic model trained from a set of unnecessary words. JP2011-154099A (Patent Literature 2) discloses a configuration in which a language model is created by modeling unexpected sentences, a dictionary is created with unexpected words registered therein, and unexpected utterances are discarded to reduce the rate of malfunctions. Non-Patent Literature 1 discloses a speech recognition technique for modeling extraneous sounds of noises, such as breath and coughs, as garbage models. Additionally, a diarization technique is known which relates to meeting diarization and divides speech into non-speech segments, speech segments, and overlapped speech segments (Non Patent Literature 2).
However, none of the conventional speech recognition techniques in Patent Literature 1, Patent Literature 2, and non-Patent Literature 1 take simultaneous speech segments into consideration. Although the conventional technique in Patent Literature 1 involves dividing speech into non-speech segments, speech segments, and overlapped speech segments, the technique concerns meeting diarization and it is difficult to apply the technique directly to speech recognition. Therefore, there is still demand for development of a technique which can efficiently train a speech recognition model capable of discriminating simultaneous speech segments in mixed speakers' speech in monaural sound and the like with high accuracy.
Patent Literature 1: WO2004/075168A1.
Patent Literature 2: JP2011-154099A.
Non-patent Literature 1: G. Sarosi, et al., “On Modeling Non-word Events in Large Vocabulary Continuous Speech Recognition”, Cognitive Infocommunications (CogInfoCom) 2012 IEEE 3rd International Conference, 649-653, Dec. 2-5, 2012, Kosice, Slovakia.
Non-patent Literature 2: K. Boakye, et al., “Overlapped speech detection for improved speaker diarization in multiparty meetings”, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008), 4353-4356, Mar. 31 2008-Apr. 4 2008, Las Vegas, Nev., USA.