Human speech is produced by utilizing a vocal tract that has certain normal resonant modes of vibration (formants) that depend largely on an exact position of articulators, such as the tongue, lips, jaw, and velum, that change position during continuous speech, thereby changing the shapes of lung, pharynx, mouth and nasal cavities to facilitate development of different sounds. Perceptually, about the first three formant frequencies for vowels are important in determining sound, but higher formant frequencies are necessary to produce high quality sounds. Three primary modes are typically utilized for exciting the vocal tract: for voiced sounds, broadband semi-periodic breaths of air are passed by the glottis and are utilized to vibrate vocal cords; for unvoiced sounds like s, the vocal tract is constricted to provide turbulent semi-random air flow; and for unvoiced sounds like p, the vocal tract is constricted, then rapidly releases built-up air pressure. A simple digital model of speech production may utilize a source of excitation such as an impulse generator, controlled by a pitch-period signal and a random number generator. The impulse generator produces an impulse (like a breath of air) once every M.sub.o samples, like a pitch period. The reciprocal of this period is the pitch frequency (vocal cord oscillation rate). The random number generator provides an output that is used to simulate the semi-random air turbulence and pressure buildup for unvoiced sources. An alternative excitation model that generally performs better than the simple binary model is the model that produces an excitation signal to the vocal tract system by passing a selected noise-like excitation signal to a time-varying pitch synthesis filter. Parameters of the pitch synthesis filter control a degree of periodicity and a period of the excitation signal. Use of this model does not require explicit classification of a speech frame to voiced or unvoiced. Whether a simple binary source model or an excitation model using the pitch filter is used, such sources are typically applied to a linear, time-varying digital filter to simulate the vocal tract system. Thus, the filter coefficients are utilized to specify the vocal tract as a function of time during continuous speech. For example, on an average, filter coefficients may be varied once every 10 milliseconds to show a new vocal tract configuration. This filter coefficient configuration is usually obtained through linear predictive analysis. Of course, gain control may also be utilized to provide a desired acoustic output level.
As computer engineering and digital signal processing technology has advanced, there has been an increasing demand for cost-efficient transmission of digital information through communication links. To meet this demand, high-speed packet-switched communication networks have been developed. In a packet-switched network, data, voice, and other informational traffic are separately packetized and then transmitted via a same communication channel. To send voice through a packet-switched network, an analog voice input signal is typically digitized and segmented into speech frames that have a fixed length. Each speech frame is analyzed and encoded (compressed) to a set of digital parameters. These sets of parameters are packetized and transmitted via the packet-switched network. At a receiving end of that network, the received packets are first de-packetized, then decoded to the parameters which are subsequently utilized by a speech synthesizer to reproduce an analog voice output.
The packet-switched communication network typically multiplexes different information sources into a single communication channel to maximize bandwidth utilization. However, during peak transmission periods, the network can become congested. When the network is congested, packets are held in queues of switching nodes, causing delays in delivery of packets. A widely used method for relieving network congestion is discarding voice packets. When voice packets containing perceptually important and/or hard to reconstruct speech frames are discarded, there is a loss of clarity in the reconstructed analog voice output. Thus, there is a need for a method and device for prioritizing voice packets such that the voice packets containing perceptually important and/or hard-to-reconstruct speech frames are given a high priority.