It is desirable in some scenarios, such as in a multiparty teleconference scenario, to automatically identify who is participating in the conference and which participant(s) is/are currently talking, which can facilitate the communication among the multiple participants, especially when the visual information of the talkers is unavailable. Speaker identification can provide valuable information for the system to provide operations for better user experience, such as speaker dependent quality enhancement. Speaker identification has also been an important tool in meeting transcription.
Generally, it is not a problem to automatically identify which participant(s) is/are currently talking, if each speaker has his or her own telephone endpoint, i.e., where no two participants share the same telephone endpoint. In such a scenario, the telephony system can use respective identifiers of the various endpoints connected to a conference as identifiers of the participants and voice activity detection (VAD) can be used for identifying who is currently talking. For example, if “Adam” is using Endpoint A to participate in a conference, the telephony system can detect voice activity in the uplink stream received from Endpoint A and then recognize that “Adam” is currently talking.
However, it is not straightforward to identify who is participating in the conference and which participant(s) is/are currently talking, if multiple participants join the conference via the same endpoint, for example, if they join the conference via a conference phone in a meeting room. In such a scenario, in order to automatically identify the speakers, one approach is to use speech audio processing to identify the respective voices of different participants.
Traditional speaker identification methods, also referred to as monaural speaker modeling methods, generally relate to monaural telephony systems. By using such methods, all the input audio signals, even signals from the endpoint with multiple channels, may be pre-converted into a monaural audio signal for the subsequent identification process. In this sense, the mono channel based methods do not perform well in a scenario where multiple participants join a conference via the same endpoint with multiple channels. For example, the identification of the respective speakers tends to be less accurate than desirable, or the associated computational burden tends to be too high. Those methods also suffer from various robustness issues, especially when an overlapped speech involves two or more speakers or a speech coming from a moving speaker.