In general, for speaker recognition, speech processing aims to increase the effects on the spoken word of different speakers, whereas for speech recognition, in which a particular word (or, sometimes, a phrase or a phoneme, or other spoken matter), is recognised, speech processing aims to reduce the effects on the spoken word of different speakers.
It is common in to input speech data, typically in digital form, to a front-end processor, which derives from the stream of input speech data more compact, more perceptually significant data referred to as input feature vectors (or sometimes as front-end feature vectors). Where the speaker speaks a predetermined word, known to the recognition apparatus and to the speaker (e.g. a personal identification number in banking) the technique is known as ‘text-dependent’. In some applications of speaker recognition a technique is used which does not require the content of the speech to be predetermined, such techniques are known as ‘text independent’ techniques.
In text-dependent techniques a stored representation of the word, known as a template or model, is previously derived from a speaker known to be genuine. The input feature vectors derived from the speaker to be recognised are compared with the template and a measure of similarity between the two is compared with a threshold for an acceptance decision. Comparison may be done by means of Dynamic Time Warping as described in “On the evaluation of Speech Recognisers and Data Bases using a Reference System”, Chollet & Gagnoulet, 1982 IEEE, International Conference on Acoustics, Speech and Signal Processing, pp 2026-2029. Other means of comparison include Hidden Markov Model processing and Neural Networks. These techniques are described in British Telecom Technology Journal, Vol. 6, No. 2 Apr. 1988, “Hidden Markov Models for Automatic Speech Recognition : Theory And Application”, SJ Cox pages 105-115, “Multi-layer perceptrons applied to speech technology”, McCullogh et al, pages 131-139 and “Neural arrays for speech recognition”, Tattershall et al pages 140-163.
Various types of features have been used or proposed for speech processing. In general, since the types of features used for speech recognition are intended to distinguish one word from another without sensitivity to the speaker whereas those for speaker recognition are intended to distinguish between speakers for a known word or words, a type of feature suitable for one type of recognition may be unsuitable for the other. Some types of feature suitable for speaker recognition are described in “Automatic Recognition of Speakers from their voices”, Atal, Proc IEEE vol 64 pp 460-475, April 1976.