1. Field of the Invention
The present invention relates to speech processing and in particular to processing for speaker recognition.
2. Related Art
Recognition processing includes speaker recognition, in which the identity of the speaker is detected or verified, and speech recognition, in which a particular word (or, sometimes, a phrase or a phoneme, or other spoken matter) is detected. Speech recognition includes so-called speaker-independent recognition, in which speech data derived from multiple speakers is used in recognition processing, and so-called speaker dependent recognition, in which speech data derived from a single speaker is used in recognition processing. In general, in speech recognition, the processing aims to reduce the effects on the spoken word of different speakers, whereas in speaker recognition the reverse is true.
It is common in recognition processing to input speech data, typically in digital form, to a so-called front-end processor, which derives from the stream of input speech data a more compact, more perceptually significant set of data referred to as a front-end feature set or vector. For example, speech is typically input via a microphone, sampled, digitized, segmented into frames of length 10-20 ms (e.g. sampled at 8 KHz) and, for each frame, a set of K coefficients (typically 5-25) is calculated. Since there are N frames e.g. 25-100 per word, there are N.times.K (on the order of 1,000) coefficients in a feature vector. In speaker recognition the speaker to be recognized is generally assumed to be speaking a predetermined word, known to the recognition apparatus and to the speaker (e.g. a PIN in banking). A stored representation of the word, known as a template, comprises a reference feature matrix of that word previously derived from a speaker known to be genuine. The input feature matrix derived from the speaker to be recognized is compared with the template and a measure of similarity between the two is compared with a threshold for an acceptance decision.
A problem arises from the tendency of speakers to vary the rate at which words are spoken, so that an input speech matrix corresponding to a given word may be longer (i.e. consists of more frames) or shorter than the template for that word. It is therefore necessary for the recognition apparatus to time-align the two matrices before a comparison can be made, and one well known method of time-alignment and comparison is the Dynamic Time Warp (DTW) method described, for example, in "Speaker Independent Recognition of words using Clustering Techniques", Rabiner et al, IEEE Trans. on ASSP, vol 24, no. 4, August, 1979.
Various features have been used or proposed for recognition processing. In general, since the features used for speech recognition are intended to distinguish one word from another without sensitivity to the speaker whereas those for speaker recognition are intended to distinguish speakers for a known word or words, a feature suitable for one type of recognition may be unsuitable for the other. Some features for speaker recognition are described in "Automatic Recognition of Speakers from their voices", Atal, Proc IEEE vol 64 pp 460-475, April, 1976.
One known type of feature coefficient is the cepstrum. Cepstra are formed by performing a spectral decomposition (e.g. a spectral transform such as the Fourier Transform), taking the logarithm of the transform coefficients, and performing an inverse spectral decomposition.
In speaker recognition, the LPC (Linear Prediction Coefficient) cepstrum and FET (Fast Fourier Transform) cepstrum features are known, the former being more widely used.
In speech recognition, a known feature is the mel-frequency cepstrum coefficient (MFCC). A description of an algorithm for calculating MFCC's, and calculating a distance measure between an MFCC feature vector and a word template using Dynamic Time Warping is given in "On the evaluation of Speech Recognisers and Data Bases using a Reference System", Chollet & Gagnoulet, 1982 IEEE, International Conference on Acoustics, Speech and Signal Processing, pp 2026-2029, incorporated herein in its entirely (including its references).
An MFCC feature vector in general is derived by performing a spectral transform (e.g. a FFT), on each frame of a speech signal, to derive a signal spectrum; integrating the terms of the spectrum into a series of broad bands, which are distributed in an uneven, so-called `mel-frequency` scale along the frequency axis; taking the logarithms of the magnitude in each band; and then performing a further transform (e.g. a Discrete Cosine Transform (DCT)) to generate the MFCC coefficient set for the frame. It is found that the useful information is generally confined to the lower order coefficients. The mel-frequency scale may, for example, be frequency bands evenly spaced on a linear frequency scale between 0-1 Khz, and evenly spaced on a logarithmic frequency scale above 1 KHz.
MFCC's eliminate pitch information, which is useful for speech recognition since this varies between speakers, but undesirable for speaker recognition. MFCC's have accordingly not been preferred for speaker recognition.