1. Field of the Invention
The present invention relates to an apparatus, method and computer program product for feature extraction that calculate a difference between pitch frequencies from an input speech signal.
2. Description of the Related Art
A differential pitch frequency per unit time is one element of speech prosody information. From the differential pitch frequency information, one can obtain information on accents, intonation, and if the speech is voiced or unvoiced sounds. The differential pitch frequency information is therefore adopted in speech recognition devices, voice activity detectors, pitch frequency estimation devices, speaker recognition devices, and the like. A method of obtaining the differential pitch frequency information is described, for example, in Sadaoki Furui, “Dijitaru Onsei Shori (Digital Speech Processing)”, Tokai University Press, pp. 57-59 (1985). According to the method in this document, pitch frequencies are first estimated, and then the amount of change in pitch frequencies with time is calculated to obtain the differential pitch frequency information.
With the method according to the above document, however, erroneous pitch frequencies could be estimated, and thus the differential pitch frequency information obtained from these pitch frequencies could be erroneous. Recently, methods of obtaining the differential pitch frequency information less influenced by errors in the pitch frequency estimation have been suggested. One of such methods is described in JP-A 2940835 (KOKAI). According to this document, a cross-correlation function between autocorrelation functions for predictive residuals of a speech at times (frames) t and s is calculated. Then, a peak of this cross-correlation function is extracted so that differential pitch frequency information, in which influence of errors in the pitch frequency estimation is reduced while multiple pitch frequency choices are taken into consideration, can be obtained.
With the method according to JP-A 2940835 (KOKAI), however, the pitch frequency estimation is based on the predictive residuals of the speech. This means that, when extracting the peak value of the cross-correlation function, a peak value that is not the one that corresponds to the differential pitch frequency may be estimated under the influence of the background noise, which makes it difficult to obtain accurate differential pitch frequency information. Furthermore, in the autocorrelation function of the predictive residuals, multiple peaks appear at integral multiples of the pitch period. If peaks at integral multiples are incorporated, the amount of differentials is also multiplied by this integer. For this reason, to obtain accurate differential pitch frequency information, the range of the autocorrelation function of the predictive residuals that is used to obtain the cross-correlation function needs to be narrowed down to the vicinity of the accurate pitch frequency. Then, the pitch frequency has to be calculated in advance, and the range of the pitch frequency has to be suitably determined in accordance with the voice pitch of the speaker. It is technically difficult, however, to suitably determine the range of the pitch frequency. For this reason, a technology of obtaining differential pitch frequency information in which the influence of the background noise is reduced, without narrowing down the range of the pitch frequency, has been sought.