It will become required in the United States to take visually impaired persons into consideration when designing mobile phones. Manufactures of mobile phones must offer phones with a user interface suitable for a visually impaired user. In practice, this means that the menus are “spoken aloud” in addition to being displayed on the screen. It is obviously beneficial to store these audible messages in as little memory as possible. Typically, text-to-speech (TTS) algorithms have been considered for this application. However, to achieve reasonable quality TTS output, enormous databases are needed and, therefore, TTS is not a convenient solution for mobile terminals. With low memory usage, the quality provided by current TTS algorithms is not acceptable.
Besides TTS, a speech coder can be utilized to compress pre-recorded messages. This compressed information is saved and decoded in the mobile terminal to produce the output speech. For minimum memory consumption, very low bit rate coders would be desired. To generate the input speech signal to the coding system, either human speakers or high-quality (and high-complexity) TTS algorithms can be used.
In a typical speech coder, the input speech signal is processed in fixed-length segments called frames. In current speech coders the frame length is usually 10-30 ms, and a lookahead segment of around 5-15 ms from the subsequent frame may also be available. The frame may further be divided into a number of subframes. For every frame, the encoder determines a parametric representation of the input signal. The parameters are quantized, and transmitted through a communication channel or stored in a storage medium. At the receiving end, the decoder constructs a synthesized signal based on the received parameters, as shown in FIG. 1.
While one underlying goal of speech coding is to achieve the best possible quality at a given coding rate, other performance aspects also have to be considered in developing a speech coder to a certain application. In addition to speech quality and bit rate, the main attributes described in more detail below include coder delay (defined mainly by the frame size plus a possible lookahead), complexity and memory requirements of the coder, sensitivity to channel errors, robustness to acoustic background noise, and the bandwidth of the coded speech. Also, a speech coder should be able to efficiently reproduce input signals with different energy levels and frequency characteristics.
Quantization of the pitch contour is a task that is required in almost all practical speech coders. The pitch parameter is related to the fundamental frequency of speech: during voiced speech, the pitch corresponds to the fundamental frequency and can be perceived as the pitch of speech. During purely unvoiced speech, there is no fundamental frequency in a physical sense and the concept of pitch is vague. In most speech coders, however, the “pitch information” is also needed during unvoiced speech. For example, in coders based on the well-known code excited linear prediction (CELP) approach, the long term prediction lag (roughly corresponding to pitch) is also transmitted during unvoiced portions of speech.
In a typical speech coder, the pitch parameter is estimated from the signal at regular intervals. The pitch estimators used in speech coders can roughly be divided into the following categories: (i) pitch estimators utilizing the time domain properties of speech, (ii) pitch estimators utilizing the frequency domain properties of speech, (iii) pitch estimators utilizing both the time and frequency domain properties of speech.
The most common prior-art solution to the quantization of the pitch contour (pitch values estimated at regular intervals) is to use scalar quantization. Typically, a single quantizer is used for all pitch values and the transmission rate is held fixed. Alternative solutions have also been proposed. For example, every second pitch value can be quantized using a scalar quantizer and the values between these can be coded with a differential quantizer. In some of the existing encoders, the quantizer contained two modes, a memoryless mode and a predictive mode. These techniques offer some advantages, when compared to the basic approach, but the redundancies are only partially exploited.
The main drawback of the prior art is that the conventional quantization techniques with fixed update rates are inherently inefficient because there is a lot of redundancy in the pitch values transmitted. The fixed update rate used in the quantization of the pitch parameter is usually rather high (about 50 to 100 Hz) in order to be able to handle cases in which the pitch changes rapidly. However, rapid variations in the pitch contour are relatively rare. Consequently, a much lower update rate could be used most of the time.