FIG. 6 is a schematic view illustrating an overview of a general speaker recognition technique. In general, as illustrated in FIG. 6, speaker recognition can be roughly divided into speaker identification and speaker verification. In the speaker identification, speech is input (received), one among preregistered speakers who has made the input speech is recognized, and an ID (Identification) of the speaker is outputted. The ID is an identifier that uniquely specifies the speaker and is assigned to the speaker upon registration. Whereas, in the speaker verification, speech and an ID are input, whether the input speech is made by a speaker of the input ID, in other words, the authentication of the speaker is determined, and either Accept or Reject is outputted.
NPTL 1 describes an example of a general speaker identification device. FIG. 7 is a block diagram illustrating a schematic structure of a general speaker identification device. As illustrated in FIG. 7, a general speaker identification device includes a registration unit 10 and an identification unit 20. The registration unit 10 includes a feature extraction unit 101 and a learning unit 102.
The feature extraction unit 101 computes feature amounts that are necessary for speaker identification of an input speech. Mel-Frequency Cepstrum Coefficients (MFCC) described in NPTL 2 are used for the feature amounts.
The learning unit 102 creates speaker models from the computed feature amounts. The speaker model is a probability model that expresses the features of speech of a speaker. A known Gaussian Mixture Model (GMM) is used for the speaker model. The speaker model is stored in association with an ID of a registered speaker.
The identification unit 20 includes a feature extraction unit 201 and a score computing unit 202. The function of the feature extraction unit 201, which is the same function as the feature extraction unit 101 of the registration unit 10, computes feature amounts necessary for speaker identification from the input speech. The score computing unit 202 compares the computed feature amounts and the speaker models of preregistered speakers, and outputs a speaker ID relating to the speaker model with the highest score as the identification result. The score is a likelihood of a model in relation to a feature amount, where the higher the likelihood is, the more similar the input speech and the speech of the registered speaker are.
NPTL 3 describes an example of a general speaker verification device. FIG. 8 is a block diagram illustrating a schematic structure of a general speaker verification device. As illustrated in FIG. 8, a general speaker verification device includes a registration unit 30 and a verification unit 40.
The registration unit 30 includes a feature extraction unit 301, a feature extraction unit 302, and a learning unit 303. The feature extraction unit 301 and the feature extraction unit 302 have the same function, and compute feature amounts necessary for speaker verification from the input speech. The feature extraction unit 301 inputs speech of a speaker to be registered and outputs a speech feature amount of the speaker to be registered. Whereas, the feature extraction unit 302 inputs speech of a plurality of speakers other than the speaker to be registered and outputs speech feature amounts of the plurality of speakers other than the speaker to be registered. For the feature amounts, GMM Supervectors (GSV) are used. As described in NPTL 3, the GSV is a supervector that is obtained by extracting only average vectors of the speaker models that are expressed as GMMs and concatenating the average vectors. In other words, first of all, it is necessary to create speaker models from speech in order to calculate a GSV.
The learning unit 303 learns classifiers by sorting the feature amount of a speaker to be registered as a positive instance and the feature amounts of a plurality of speakers as negative instances. Known Support Vector Machines (SVM) are used for learning of the classifiers. The SVM is a method of acquiring a plane (a classification plane) that separates the feature points of positive instances and the feature points of negative instances. The shortest distance between the classification plane and the feature points is referred to as a margin, and the parameters of the classification plane are learned so as to maximize this margin. NPTL 4 describes a margin maximization criterion of an SVM.
The verification unit 40 includes a feature extraction unit 401 and a score computing unit 402. The function of the feature extraction unit 401, which is the same function as the feature extraction unit 301 and the feature extraction unit 302 of the registration unit 30, computes a GSV as a feature amount from the input speech. The score computing unit 402 outputs a score of two values (1 or −1) as the verification result, using the computed feature amounts and the classifiers relating to the input ID. In this case, score 1 means that the input speech and the input ID are of the same speaker's (principal), while score −1 means that the input speech and the input ID are of different speakers' (impostor).
The method of modelling speech of a speaker using the GMM described in NPTL 1, can be used not only for speaker identification but also for speaker verification. NPTL 3 compares the verification precision of a method based on the GMM and a method based on the above-described SVM, where the latter shows higher precision. However, as there is no effective method of using the SVM for speaker identification, methods based on the GMM are mainly used.