The present invention relates to a method of and a system for recording voices made by a plurality of speakers and specifying each of the speakers based on the recorded voices.
Along with advancement and accuracy improvement of voice recognition technologies, application fields thereof have been increasingly widespread. The voice recognition technology has started to be used for creation of business documents by dictation, medical observations, creation of legal documents, creation of closed captions for television broadcasting, and the like. Moreover, in trials, meetings, or the like, there has been considered introduction of a technology of conversion into text by using voice recognition, in order to create records and minutes by recording processes and writing the processes in texts.
In a situation where such a voice recognition technology is used, it may be required not only to simply recognize recorded voices but also to specify each of speakers of individual voices from voices made by a plurality of speakers. As a method for specifying speakers, there have been heretofore proposed various methods such as a technology of specifying speakers based on a direction in which voices arrive by use of directional characteristics obtained by a microphone array or the like (for example, see Patent Document 1) and a technology of adding identification information for specifying speakers by converting voices individually recorded for each of the speakers into data (for example, see Patent Document 2).
[Patent Document 1] Japanese Patent Laid-Open Publication No. 2003-114699
[Patent Document 2] Japanese Patent Laid-Open Publication No. Hei 10 (1998)-215331
As described above, in the voice recognition technology, it may be required to specify each of the speakers of the individual voices from the recorded voices of the plurality of speakers. There have been heretofore proposed various methods. However, by use of a method of specifying each of the speakers by use of directional microphones such as the microphone array, it was impossible to achieve sufficient accuracy depending on voice recording environments and other conditions, such as the case where the plurality of speakers exist in similar directions from the microphones.
Moreover, a method of individually recording voices for each of speakers requires recorders prepared for the respective speakers. Accordingly, since a system scale is increased, costs and efforts in system introduction and system maintenance are increased.
Incidentally, speeches in trials or meetings have the following characteristics.                Questions and answers make up a large part of dialogues, and the questioner hardly questions a plurality of respondents at the same time.        Except unexpected remarks such as jeers, only one person makes a speech at one time, and voices rarely overlap.        
In such a special recording environment, in order to specify each of the speakers of the individual voices from the voices made by the plurality of speakers, it is considered to utilize the characteristics of the recording environment as described above.