In speech recognition processing, speaker verification refers to the task of determining whether a speech sample of an unknown voice corresponds to a particular enrolled speaker. Prior art speaker verification systems typically generate a score for the speech sample using both a speaker-specific acoustic model and a separate, “generic” or “speaker independent” acoustic model. If the speaker-specific acoustic model outscores the generic acoustic model sufficiently, the speech sample is deemed to be from the enrolled speaker under consideration.
Speaker identification is a related task that involves associating a speech sample of an unknown voice with a speaker in a set of enrolled speakers. Prior art speaker identification systems are similar to prior art speaker verification systems, but score the speech sample using all available speaker-specific acoustic models. The speech sample is deemed to be from the enrolled speaker whose acoustic model produces the highest score. In certain cases, the speech sample is also scored against a generic acoustic model, such that the speech sample is deemed to be from an “imposter” (i.e., someone not in the set of enrolled speakers) if the highest scoring speaker-specific acoustic model does not outscore the generic acoustic model sufficiently.
One issue with scoring a speech sample against one or more speaker-specific acoustic models as well as a separate, generic acoustic model as noted above is that this increases the overall processing time for the speaker verification/identification task, since the scoring process must be repeated for the generic acoustic model. This can be problematic if the task is being performed on a device with limited compute resources. Further, in the speaker verification scenario, an appropriate decision threshold must be set with respect to the score generated via the speaker-specific acoustic model and the score generated via the generic acoustic model in order to determine whether to accept or reject the speech sample as being from the claimed speaker. The tuning of this decision threshold is difficult since it must account for potential variability in both scores.
Accordingly, it would be desirable to have improved techniques for speaker verification and identification that address the foregoing and other issues.