This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 11-299745, filed Oct. 21, 1999, the entire contents of which are incorporated herein by reference.
The present invention relates to a speech collating apparatus and a speech collating method for identifying a person with speech data.
Generally, for identifying a speaker with a speech, a speech signal to be collated is converted to an acoustic parameter such as the frequency spectrum or the like before a collation since it is not efficient to directly compare the speech signal with registered speech signals. Other acoustic parameters available for the purpose may be the principle frequency (pitch frequency), speech energy, format frequency, zero-crossing number, and the like.
Here, since these acoustic parameters include phonetic information primarily and personal information secondarily, a new characteristic amount unique to a speaker must be created from the acoustic parameters for comparison in order to improve the hit rate when the speaker is identified.
A conventional speaker identification is performed in the following manner.
FIG. 14 is a flow chart illustrating a procedure of a speaker identification by means of a conventional speech collating apparatus.
(1) An input speech signal uttered for a word is divided into frames of predetermined unit time, and the frequency spectrum is calculated for each of the frames to derive a time series distribution of the frequency spectra (hereinafter referred to as the xe2x80x9csound spectrogramxe2x80x9d) (step C1).
(2) A speech section is detected from the sound spectrogram. (step C2).
(3) It is determined whether the speech section is a spoken, a non-spoken, or a silent section to extract the spoken sections from the speech section. Then, the speech section divided into blocks each of which corresponds to each of the spoken sections (step C3).
(4) As a characteristic amount unique to the speaker, an additive average of the sound spectrogram in the time direction (hereinafter referred to as the xe2x80x9caverage spectrumxe2x80x9d) is calculated for the blocks (step C4).
(5) It is determined whether the processing is for registration or for collation, and the average spectrum for the blocks is registered as a characteristic amount of a registered speaker when the registration is intended (steps C5xe2x86x92C6).
(6) It is determined whether the processing is for registration or for collation, and the similarity with respect to the characteristic amount of the registered speaker is calculated with the average spectrum of the blocks used as a characteristic amount of an unknown speaker (steps C5xe2x86x92C7).
(7) The similarity of the unknown speaker to the registered speaker is compared with a previously set threshold value to determine the identity of the registered speaker with the unknown speaker (step C8).
As described above, the speaker identification procedure performed by the conventional speech collating apparatus collates a speech signal input by a registered speaker (hereinafter referred to as the xe2x80x9cregistered speech signalxe2x80x9d) with a speech signal input by an unknown speaker for collation (hereinafter referred to as the xe2x80x9cunknown speech signalxe2x80x9d) by (1) converting the speech signal to the sound spectrogram; (2) detecting a speech section from the sound spectrogram; (3) extracting a spoken section from the detected speech section based on a determination whether the speech section is a spoken, a non-spoken, or a silent section; and (4) deriving a characteristic amount for each of blocks divided from the extracted spoken section. In this way, the calculation of the characteristic amount applied to the collation processing for actually determining the identity of the registered speech signal with the unknown speech signal involves at least four preprocessing stages, so that a large number of processing steps are required for the overall speaker identification processing.
Also, although the conventional speaker identification procedure which utilizes the additive average of the sound spectrogram in a block in the time direction as a characteristic amount unique to a speaker is advantageous in its relatively simple processing, the creation of a stable characteristic amount requires speech signal data for a relatively long period of time. In addition, since information in the time axis direction is compressed, this procedure is not suitable for text dependent speaker identification. Moreover, since the conventional speaker identification procedure averages personal information superimposed on the phonetic information to the accompaniment of the averaging of the phonetic information, a sufficient characteristic amount is not provided. For this reason, an extra characteristic amount must be added for improving the hit rate, resulting in requiring an extremely large number of preprocessing steps.
Therefore, the improvement of the hit rate implies the problem of an extremely large number of preprocessing steps involved therein.
Accordingly, it is an object of the present invention to provide a speech collating apparatus and a speech collating method which are capable of identifying a speaker at a high hit rate without the need for a large number of preprocessing steps.
According to the present invention, a speech data collating apparatus comprising data converting means for converting two speech signals subjected to a comparison to two two-dimensional data indicative of speech characteristics of the two speech signals; template placing means for placing a plurality of templates for defining a plurality of areas on one of the two-dimensional data; correlated area detecting means for detecting areas on the other of the two-dimensional data and having a maximum correlation with regard to a plurality of areas on the other of the two-dimensional data and corresponding to the plurality of templates; and collation determining means for comparing a mutual positional relationship of the plurality of templates on the one of the two-dimensional data with a mutual positional relationship of the plurality of areas on the other of the two-dimensional data detected by the correlated area detecting means to determine identity between the two speech signals.
Additional objects and advantages of the present invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the present invention.
The objects and advantages of the present invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out hereinafter.