The present invention, in some embodiments thereof, relates to speech processing and, more specifically, but not exclusively, to determining pitch marks for speech processing.
In speech processing, a continuous speech signal, for example recorded by a digital microphone, is analyzed to determine the parameters of the signal before further processing the signal, the speech, and the like. One of the basic parameters is the speech signal's pitch, which is the perceived audible frequency of the speech sound. The pitch comprises a frequency, such as the fundamental frequency of the speech signal, and pitch marks, which are associated with glottal closure instants (GCIs) produced by the vocal chords. As used herein, a pitch mark means a temporal value, such as a time value, and may be relative to a recent event, or an absolute temporal value. A pitch epoch is a window of the speech signal surrounding the GCIs and/or pitch marks. The pitch period may be parameterized in addition to or instead of the pitch frequency, where the pitch frequency is units of cycles per second, such as Hertz, and the pitch period is units of seconds, number of samples, and the like. For each pitch epoch, a speech signal section is produced and repeated at the pitch frequency, with possible overlap between each individual speech signal sections. The speech processing may rely on the speech signal, pitch, pitch marks, and/or the like, such as in Time Domain Pitch Synchronous Overlap and Add (TD-PSOLA) processing.
The quality of synthesized speech, such as text-to-speech (TTS), and/or recorded speech, undergoing prosody and/or other modifications via TD-PSOLA processing, depends on accurate determination of pitch marks. For example, to perform prosody modification with high audible quality for a TTS signal, the consistency of pitch marks should be maintained both between adjacent epochs and over a large number of epochs, such as in avoiding pitch drift, pitch lag, and the like. Reference is now made to FIG. 1, which is a schematic diagram of TD-PSOLA pitch modification of a voiced speech segment. For example, a continuous speech signal 121 is processed to determine pitch values, pitch mark temporal values 120C, such as along a time axis 124, and pitch epochs 120B. By modifying the speech signal 121 of each pitch epoch 120B to decrease the pitch period, such as decrease the time between the pitch marks 120C, produces an increase in the pitch of the speech signal 122 and the speech may be heard as having a higher frequency. By modifying the speech signal 121 of each pitch epoch 120B to increase the pitch period, such as increase the time between the pitch marks 120C, produces a decrease in the pitch of the speech signal 123 and the speech may be heard as having lower frequency. As used herein, the term local pitch consistency means the pitch consistency between temporally adjacent pitch epochs. As used herein, the term global pitch consistency means the pitch consistence across a large number of pitch epochs.
The importance of pitch marking in speech processing has resulted in many pitch marking methods being developed. For example, Dikshit et al describe several of these algorithms in the work titled “An Algorithm for Locating Fundamental Frequency Markers in Speech Signals” published in the Proceedings of Acoustics, Speech, and Signal Processing, 2005 (ICASSP '05) pages 233 to 236, incorporated herein by reference in its entirety. For example, other algorithms are described by Höge et al in “Evaluation of Pitch Marking Algorithms” published in the Proceedings of the ITG, Kiel, Germany, 2006, incorporated herein by reference in its entirety.