I. Field of the Invention
The present invention pertains generally to the field of speech processing, and more specifically to a method and apparatus for synthesis of speech from pitch prototype waveforms by time-synchronous waveform interpolation (TSWI).
II. Background
Transmission of voice by digital techniques has become widespread, particularly in long distance and digital radio telephone applications. This, in turn, has created interest in determining the least amount of information that can be sent over a channel while maintaining the perceived quality of the reconstructed speech. If speech is transmitted by simply sampling and digitizing, a data rate on the order of sixty-four kilobits per second (kbps) is required to achieve a speech quality of conventional analog telephone. However, through the use of speech analysis, followed by the appropriate coding, transmission, and resynthesis at the receiver, a significant reduction in the data rate can be achieved.
Devices that employ techniques to compress speech by extracting parameters that relate to a model of human speech generation are called speech coders. A speech coder divides the incoming speech signal into blocks of time, or analysis frames. Speech coders typically comprise an encoder and a decoder, or a codec. The encoder analyzes the incoming speech frame to extract certain relevant parameters, and then quantizes the parameters into binary representation, i.e., to a set of bits or a binary data packet. The data packets are transmitted over the communication channel to a receiver and a decoder. The decoder processes the data packets, unquantizes them to produce the parameters, and then resynthesizes the speech frames using the unquantized parameters.
The function of the speech coder is to compress the digitized speech signal into a low-bit-rate signal by removing all of the natural redundancies inherent in speech. The digital compression is achieved by representing the input speech frame with a set of parameters and employing quantization to represent the parameters with a set of bits. If the input speech frame has a number of bits Ni and the data packet produced by the speech coder has a number of bits No, the compression factor achieved by the speech coder is Cr=Ni/No. The challenge is to retain high voice quality of the decoded speech while achieving the target compression factor. The performance of a speech coder depends on (1) how well the speech model, or the combination of the analysis and synthesis process described above, performs, and (2) how well the parameter quantization process is performed at the target bit rate of N0 bits per frame. The goal of the speech model is thus to capture the essence of the speech signal, or the target voice quality, with a small set of parameters for each frame.
A speech coder is called a time-domain coder if its model is a time-domain model. A well-known example is the Code Excited Linear Predictive (CELP) coder described in L. B. Rabiner and R. W. Schafer, Digital Processing of Speech Signals 396-453 (1978), which is fully incorporated herein by reference. In a CELP coder, the short term correlations, or redundancies, in the speech signal are removed by a linear prediction (LP) analysis, which finds the coefficients of a short-term formant filter. Applying the short-term prediction filter to the incoming speech frame generates an LP residue signal, which is further modeled and quantized with long-term prediction filter parameters and a subsequent stochastic codebook. Thus, CELP coding divides the task of encoding the time-domain speech waveform into the separate tasks of encoding of the LP short-term filter coefficients and encoding the LP residue. The goal is to produce a synthesized output speech waveform that closely resembles the input speech waveform. To accurately preserve the time-domain waveform, the CELP coder further divides the residue frame into smaller blocks, or sub-frames, and continue the analysis-by-synthesis method for each sub-frame. This requires a high number of bits No per frame because there are many parameters to quantize for each sub-frame. CELP coders typically deliver excellent quality when the available number of bits No per frame is large enough for coding bits rates of 8 kbps and above.
Waveform interpolation (WI) is an emerging speech coding technique in which for each frame of speech a number M of prototype waveforms is extracted and encoded with the available bits. Output speech is synthesized from the decoded prototype waveforms by any conventional waveform-interpolation technique. Various WI techniques are described in W. Bastiaan Kleijn and Jesper Haagen, Speech Coding and Synthesis 176-205 (1995), which is fully incorporated herein by reference. Conventional WI techniques are also described in U.S. Pat. No. 5,517,595, which is fully incorporated by reference herein. In such conventional WI techniques, however, it is necessary to extract more than one prototype waveform per frame in order to deliver accurate results. Additionally, no mechanism exists to provide time synchrony of the reconstructed waveform. For this reason the synthesized output WI waveform is not guaranteed to be aligned with the original input waveform.
There is presently a surge of research interest and strong commercial needs to develop a high-quality speech coder operating at medium to low bit rates (i.e., in the range of 2.4 to 4 kbps and below). The application areas include wireless telephony, satellite communications, Internet telephony, various multimedia and voice-streaming applications, voice mail, and other voice storage systems. The driving forces are the need for high capacity and the demand for robust performance under packet loss situations. Various recent speech coding standardization efforts are another direct driving force propelling research and development of low-rate speech coding algorithms. A low-rate speech coder creates more channels, or users, per allowable application bandwidth, and a low-rate speech coder coupled with an additional layer of suitable channel coding can fit the overall bit-budget of coder specifications and deliver a robust performance under channel error conditions.
However, at low bit rates (4 kbps and below), time-domain coders such as the CELP coder fail to retain high quality and robust performance due to the limited number of available bits. At low bit rates, the limited codebook space clips the waveform-matching capability of conventional time-domain coders, which are so successfully deployed in higher-rate commercial applications.
One effective technique to encode speech efficiently at low bit rate is multimode coding. A multimode coder applies different modes, or encoding-decoding algorithms, to different types of input speech frames. Each mode, or encoding-decoding process, is customized to represent a certain type of speech segment (i.e., voiced, unvoiced, or background noise) in the most efficient manner. An external mode decision mechanism examines the input speech frame and make a decision regarding which mode to apply to the frame. Typically, the mode decision is done in an open-loop fashion by extracting a number of parameters out of the input frame and evaluating them to make a decision as to which mode to apply. Thus, the mode decision is made without knowing in advance the exact condition of the output speech, i.e., how similar the output speech will be to the input speech in terms of voice-quality or any other performance measure. An exemplary open-loop mode decision for a speech codec is described in U.S. Pat. No. 5,414,796, which is assigned to the assignee of the present invention and fully incorporated herein by reference.
Multimode coding can be fixed-rate, using the same number of bits N0 for each frame, or variable-rate, in which different bit rates are used for different modes. The goal in variable-rate coding is to use only the amount of bits needed to encode the codec parameters to a level adequate to obtain the target quality. As a result, the same target voice quality as that of a fixed-rate, higher-rate coder can be obtained at a significant lower average-rate using variable-bit-rate (VBR) techniques. An exemplary variable rate speech coder is described in U.S. Pat. No. 5,414,796, assigned to the assignee of the present invention and previously fully incorporated herein by reference.
Voiced speech segments are termed quasi-periodic in that such segments can be broken into pitch prototypes, or small segments whose length L(n) vary with time as the pitch or fundamental frequency of periodicity varies with time. Such segments, or pitch prototypes, have a strong degree of correlation, i.e., they are extremely similar to each other. This is especially true of neighboring pitch prototypes. It is advantageous in designing an efficient multimode VBR coder that delivers high voice quality at low average rate to represent the quasi-periodic voiced speech segments with a low-rate mode.
It would be desirable to provide a speech model, or analysis-synthesis method, that represents quasi-periodic voiced segments of speech. It would further be advantageous to design a model that provides a high quality synthesis, thereby creating speech with high voice quality. It would still further be desirable for the model to have a small set of parameters so as to be amenable for encoding with a small set of bits. Thus, there is a need for a method of time-synchronous waveform interpolation for voiced speech segments that requires a minimal amount of bits for encoding and yields a high quality speech synthesis.
The present invention is directed to a method of time-synchronous waveform interpolation for voiced speech segments that requires a minimal amount of bits for encoding and yields a high quality speech synthesis. Accordingly, in one aspect of the invention, a method of synthesizing speech from pitch prototype waveforms by time-synchronous waveform interpolation advantageously includes the steps of extracting at least one pitch prototype per frame from a signal; applying a phase shift to the extracted pitch prototype relative to a previously extracted pitch prototype; upsampling the pitch prototype for each sample point within the frame; constructing a two-dimensional prototype-evolving surface; and re-sampling the two-dimensional surface to create a one-dimensional synthesized signal frame, the re-sampling points being defined by piecewise continuous cubic phase contour functions, the phase contour functions being computed from pitch lags and alignment phase shifts added to the extracted pitch prototype.
In another aspect of the invention, a device for synthesizing speech from pitch prototype waveforms by time-synchronous waveform interpolation advantageously includes means for extracting at least one pitch prototype per frame from a signal; means for applying a phase shift to the extracted pitch prototype relative to a previously extracted pitch prototype; means for upsampling the pitch prototype for each sample point within the frame; means for constructing a two-dimensional prototype-evolving surface; and means for re-sampling the two-dimensional surface to create a one-dimensional synthesized signal frame, the re-sampling points being defined by piecewise continuous cubic phase contour functions, the phase contour functions being computed from pitch lags and alignment phase shifts added to the extracted pitch prototype.
In another aspect of the invention, a device for synthesizing speech from pitch prototype waveforms by time-synchronous waveform interpolation advantageously includes a module configured to extract at least one pitch prototype per frame from a signal; a module configured to apply a phase shift to the extracted pitch prototype relative to a previously extracted pitch prototype; a module configured to upsample the pitch prototype for each sample point within the frame; a module configured to construct a two-dimensional prototype-evolving surface; and a module configured to re-sample the two-dimensional surface to create a one-dimensional synthesized signal frame, the re-sampling points being defined by piecewise continuous cubic phase contour functions, the phase contour functions being computed from pitch lags and alignment phase shifts added to the extracted pitch prototype.