Many devices include microphones, which can be used to detect ambient sounds. In many situations, the ambient sounds include the speech of one or more nearby speakers. Audio signals generated by the microphones can be used in many ways. For example, audio signals representing speech can be used as the input to a speaker recognition system, allowing a user to control a device or system using spoken commands.
In a speaker recognition system, a user enrols by providing a sample of their speech, and this is used to form a model of the speech, also known as a voice print. Then, during subsequent speaker recognition attempts, samples of speech are compared with the model.
Alternatively, a user may enrol in a speaker recognition system by providing a plurality of samples of their speech, and these samples may then be used to form a model of the speech. For example, a plurality of samples of a user's speech may be received from multiple different sessions where the user has provided speech to the system. In some examples, a provided plurality of samples may be “stitched” or concatenated together to form a composite sample of the user's speech, and the composite sample may then be used to form a model of the speech.
The processes of “stitching” and concatenating a plurality of samples together to form a composite sample of a user's speech may also be used, for example, for a plurality of samples that have been received from a speaker diarisation process. The composite sample of the user's speech that is formed may then be used in a speaker verification process.
However, the processes of “stitching” and concatenating may introduce audio artefacts (for example, “pops” or “clicks”) in the composite sample. Similarly, “stitching” or concatenating separate audio samples together may introduce audio artefacts (for example, “pops” or “clicks”) in a composite audio signal. For example, during a concatenation process, an audio sample may be “cut” in such a manner that an artificially fast edge is created within the audio sample.
Additionally, audio artefacts may be introduced into an audio signal as a result of switching a microphone either on or off, or audio artefacts may be introduced into an audio signal when that audio signal is truncated.
This introduction of audio artefacts in a composite audio signal may result in problems in speaker recognition and other voice biometric systems. For example, in an automatic speaker recognition system, the presence of an audio artefact introduced into a composite sample of a user's speech may result in the misfiring of a voice activity detector. Additionally or alternatively, during a speaker enrolment process in a voice biometrics system, the presence of an audio artefact introduced into a composite sample of a user's speech may result in the voice biometrics system “learning” the audio artefact as a discriminative part of a user's speech. In other words, the audio artefact may be introduced into a model of the user's speech.