Digital coding of human speech for the purposes of compact storage and conservation of transmission bandwidth has been practiced for many years. A principal object of speech coding is to minimize the number of bits per second required to be stored for acceptable quality of speech reproduction in voice answerback and announcement applications. Analog speech samples are customarily partitioned into frames or segments of discrete length (assumed to be stationary) on the order of 20 milliseconds in duration. Sampling is typically performed at a rate of six to ten kiloHertz (kHz) and each sample is coded into a multibit digital number. Successive coded samples are further processed in a linear prediction coder (LPC) whose function is to determine the appropriate predictor parameters which can be used to estimate present values of each signal sample efficiently on the basis of the weighted sum of a preselected number of prior sample values. The parameters representing the LPC weights applied to prior sample values are related, as is well known, to the formant structure of the vocal tract transfer function. The speech signal is regarded analytically as being composed of an excitation signal and a formant transfer function. The excitation component arises in the larynx or voice box and the formant component results from the operation of the remainder of the vocal tract on the excitation component. The excitation component is further classed as voiced or unvoiced, depending upon whether or not there is a fundamental frequency imparted to the air stream by the vocal cords.
The LPC coefficients are made adaptive to the mean-square of the difference between the predicted value and the actual value at each sampling instant. The result is that the coefficient values vary slowly from one speech frame to another. These weighting coefficients and a gain factor to account for the average speech energy level constitute the LPC parameters that must be stored and made available to a speech synthesizer. The remaining information required by a speech synthesizer comprises the mode of excitation, i.e., voiced or unvoiced, and the pitch, or fundamental, period of voiced sounds.
Adaptive predictive coding of speech signals is taught by B. S. Atal in his U.S. Pat. No. 3,631,520 granted on Dec. 28, 1971.
It is further known from U.S. Pat. No. 3,740,476 issued June 19, 1973 to B. S. Atal that an adaptive LPC network models the envelope of the speech signal spectrum and can therefore be employed as an inverse filter to subtract the formant structure from the raw speech signal. The resultant residual wave accounts for the fine spectral structure of the speech waveform and approximates the excitation function of the vocal tract. In effect the speech spectrum is flattened to emphasize the glottal pulses when the excitation is voiced.
It is an object of this invention to provide an improved pitch detector which operates on the residual wave remaining after removal of the vocal tract shaping function from the speech signal.