1. Field of the Invention
The present invention relates to a method for nearly instantaneous detection of human speech pitch pulses, for use with pitch tracking processes, an important part of speech coding.
2. Discussion of the Related Art
Speech coding is used in a number of areas of voice signal processing and has many applications. In one important application, spoken language analog wave forms are sampled, digitized and processed, using speech bandwidth compression algorithms, to render compressed digitized versions of the spoken language waveforms for subsequent storage or transmission; such processing is called voice coding (or vocoding). Voice or spoken word signal analysis and bandwidth compression processes find application in digital transmission processes, such as those required for telephonic communication over a low bandwidth data channel such as the Internet, or for use in instruments used by the hearing impaired.
There is a class of sensory aids having tactile sense stimulators to be worn on the body (e.g., on the wrist), for use by deaf persons; the sensory aids are designed to provide deaf persons with access, via the sense of touch, to the acoustic waveform of speech. Intonation patterns in speech, i.e., the patterns and variation in the fundamental frequency of the voice over time, play several roles. For example, the intonation patterns help define where sentences begin and end, they mark the more important words in a sentence, and they sometimes serve to differentiate questions from statements. A wearable tactile sensory aid allows a lip reading deaf individual to lip read with greater accuracy and improves the quality and intelligibility of self generated speech responses. As an example, U.S. Pat. No. 4,581,491 issued to Arthur Boothroyd, discloses a wearable tactile sensory aid for providing information on voice pitch and intonation patterns; the entire disclosure of U.S. Pat. No. 4,581,491 is incorporated herein, in its entirety, by reference.
One problem encountered in use of the wearable tactile sensory aids of the prior art is a time lag associated with analyzing and encoding the voice pitch and intonation pattern information (within the sensory aid) and communicating voice pitch and intonation information to the wearer through an output stimulator/transducer. More particularly, there is an excessive time lag between the time an input transducer converts the spoken voice signal into an analog electrical waveform and the time at which the output transducer communicates the voice pitch and intonation pattern information to the wearer. The excessive time lag confuses the deaf wearer because some memory of what the wearer has just seen (while lip reading) must be maintained over the duration of the time lag. The tactile sensory aid (or vibrotactile aid) transmits, via an output transducer, an acoustical or vibratory signal having selected characteristics. Vibrotactile vocoders have also been used and include a bank of bandpass filters having outputs to modulate a carrier pulse transmitted using the output transducers. Perception of vibrotactile patterns is an ongoing area of research and, unfortunately, the vocoder concept requires perception of differential amplitude levels of individual stimulators in an output transducer array, but array spacing presents problems which have yet to be solved.
Turning to the more general problem, in speech analyzing systems, information must be derived from spoken language by deriving the frequency of energy in a speech formant, i.e., the frequency of a formant arising in response to a larynx excitation. Each time the larynx excites the vocal tract, the tract produces a set of exponentially damped sinusoidal waves. The exponentially damped sinusoidal wave form occurs for voiced utterances and includes frequency components generally in three ranges for formants. The ranges for the average male are 200 to 1000 hertz, 800 to 2300 hertz, and 2300 to 3800 hertz. Each time the larynx is re-excited, the previous set of sinusoidal waves is usually completely damped because the Q of the previously existing resonant cavity drops virtually to 0 in response to opening of the glottis. Thus, there is virtually no phase interference between waves deriving from adjacent larynx excitations and the damped sinusoids are easily identified by filters segmenting the frequency ranges occupied by the formants. The periods of formants have thus been an area of interest in speech analyzing systems. For example, U.S. Pat. No. 3,335,225 to Campanella and Coulter, the entire disclosure of which is incorporated herein by reference, discloses a circuit and method for tracking formant periods. By measuring the period of the damped sinusoid following each larynx excitation in the formant of interest, formant frequencies are ascertained. The period is inversely proportional to the formant frequency and can be measured as a function of the time it takes a predetermined number of half cycles of the damped sinusoid to be completed. The length of each half cycle is measured as a function of the time duration between adjacent zero reference crossings. Thus, in order to accurately measure formant period from the waveform, the first peak of the decaying exponential sinusoid must be accurately detected, and so a pitch pulse (i.e., a pulse indicating the beginning of a new waveform period) must be detected.
Prior art methods for detecting the pitch pulse have required excessive time. Acoustical signal processing circuitry is usually executed in the digital domain, wherein an analog voice waveform periodically sampled at a rate high enough to capture the spectrum of interest (e.g. 10 kHz), the sample values are quantized or converted to digital values and a digital representation of a voice waveform over a selected time interval is stored for later analysis and pitch pulse detection. Digital signal processing algorithms are used in processing the stored or buffered digital representation (for detecting pitch pulses and completing the speech waveform analysis) and may take a significant amount of time to complete, usually many pitch periods, thereby generating the unacceptable excessive time lag, as discussed above. Many uses of speech coding are hampered, in current practice, by having to have future data samples, or a large buffer of data, and produce only an average indication of pitch rate (e.g., throwing away useful information if natural-sounding reconstruction is desired). Additionally, requiring a large amount of data to be available to track pitch means that buffer based algorithms cannot function in real time, without considerable delay in producing an output. For many years, an oft-repeated lament in the field of speech signal analysis has been if one could only track formants one could track pitch, or vice-versa. The reason for this is that formants can change significantly on a pitch period by period basis, and any technique that attempts to track them by analyzing several pitch periods as a group incurs two unpleasant problems. One is time-smearing of the actual formant information, which was changing during the analysis interval. The other is called "pitch ripple" where components of the pitch period and it's harmonics pollute the formant information.
Accordingly, there has been a long felt need for a method for detecting human speech pitch pulses on a nearly instantaneous basis. To be practicable and economically feasible, the desired method should require a minimum amount of computational resources and allow the subsequent speech coding and decoding processes to be accomplished in an efficient manner.