Even when two people are speaking the same language, and have a good command of that language's vocabulary and grammar, differences between them in their manner of speaking, e.g., accent, pronunciation accuracy, prosody, speech, pitch, cadence, intonation, co-articulation, syllable emphasis, and syllable duration, can affect the ease with which they understand each other's speech.
In theory, it should be possible to process the speech from person A and manipulate it digitally so that the aspects of A's speech that make it hard for B to understand are reduced or eliminated. In practice, it is hard to envision being able to do this reliably for all of the above factors in anything close to real-time. This is because appropriate automatic manipulation of most of the above factors cannot be achieved by a straight-forward acoustic analysis, and would instead require a syntactic and semantic understanding of what is being said. One exception of this is syllable duration.
Nearly all modern speech-based computer and communication systems transmit, route, or store speech digitally. One obvious advantage of digital techniques over analog is the ability to provide superior audio quality (for example, compact discs versus phonograph records, or digital cellular telephones versus analog). Other advantages include the ability to send many more simultaneous transmissions over a single communications channel, route speech communication through computer-based switching systems, and store the speech on computer disks and in solid-state memory devices.
The following describes techniques that reduce the amount of data required to digitize speech.
Speech Digitization
The simplest way to encode speech digitally is to generate a sequence of numbers that, in essence, trace the ‘ups and downs’ of the original speech waveform. For example, if one wished to digitize a waveform in which all of the important acoustic information is below 4000 Hertz (4000 cycles per second), the basic steps of this analog-to-digital conversion would include the following:
(1) Filter from the original signal all information above 4000 Hertz.
(2) Divide the original signal into 8000 segments per second.
(3) Go through the segments in order, measuring and recording the average amplitude of the waveform within each segment.
The purpose of the first step is to prevent ‘aliasing’—the creation of false artifacts, caused by the undesired interaction of the sampling rate with the frequency of the observed events. The phenomenon in motion pictures, where the spokes of a rapidly rotating wheel may appear to be standing still or even moving backwards, is an example of aliasing.
The second step, sampling at twice the frequency of the highest-frequency sine wave, is necessary in order to capture both the peaks and the valleys of the wave.
To envision the third step more easily, imagine that the original waveform is drawn on a sheet of paper. Within every segment, each of which represents 1/8000 of a second, the height of the waveform is measured with a ruler. The sequence of numbers obtained in this manner constitutes a digital representation of the original waveform.
Regarding the ‘ruler’ used to measure within-segment speech amplitudes, speech quality comparable to that of a modern telephone requires twelve bits per segment, 8000 segments per second. (As a point of comparison, audio compact discs use 16 bits per segment, with 44,100 segments per second.) The resulting data rate of 96,000 bits per second means that a typical 1.44 MB floppy diskette can hold only about two minutes of telephone-quality speech.
Modest reductions in the data rate can be achieved by using logarithmic amplitude encoding schemes. These techniques, which represent small amplitudes with greater accuracy than large amplitudes, achieve voice quality equivalent to a standard twelve-bit system with as few as eight bits per segment. Examples include the μ-law (pronounced ‘myoo law’) coding found on many U.S. digital telephones, and the A-law coding commonly used in Europe.
For many applications in which the cost of transmission or the cost of storage is important, such as wireless telephony or voice mail systems, the data rate reductions achieved with simple μ-law and A-law encoding are inadequate. One way to achieve significant reductions in the data rate is to extract and digitize the frequency content of the waveform (rather than simply digitize the shape of the waveform).
Many coders that work in this manner have software components that map to physical components of the human vocal mechanism. They reduce the data rate by encoding only the parameters that control the changeable components of the speech production model—for example, the parameter that controls overall amplitude and the parameter that adjusts the fundamental pitch of the electronic ‘vocal cords.’
The Human Speech Production Mechanism
Given that many components in these coders have physiological counterparts, it is helpful to understand the human vocal mechanism prior to examining the coders.
The major physical components of the human speech mechanism include the lungs, the vocal cords, and the vocal cavity. When a person speaks, the lungs force air past the vocal cords and through the vocal cavity. The pressure with which the air is exhaled determines the final amplitude, or ‘loudness,’ of the speech. The action of the vocal cords on the breath stream determines whether the speech sound will be voiced or unvoiced.
Voiced speech sounds (for example, the ‘v’ sound in ‘voice’) are produced by tensing the vocal cords while exhaling. The tensed vocal cords briefly interrupt the flow of air, releasing it in short periodic bursts. The greater the frequency with which the bursts are released, the higher the pitch.
Unvoiced sounds (for example, the final ‘s’ sound in ‘voice’) are produced when air is forced past relaxed vocal cords. The relaxed cords do not interrupt the air flow; the sound is instead generated by audible turbulence in the vocal tract. A simple demonstration of the role of the vocal cords in producing voiced and unvoiced sounds can be had by placing one's fingers lightly on the larynx, or voice box, while slowly saying the word ‘voice’; the vocal cords will be felt to vibrate for the ‘v’ sound and for the double vowel (or diphthong) ‘oi’ but not for the final ‘s’ sound.
The mechanisms described above produce what is called the excitation signal for speech. Many properties of the excitation signal will differ when comparing one person to another. However, when examining a single individual, only three parameters in the excitation signal will vary as the person speaks: the amplitude of the sound, the proportion of the sound that is voiced or unvoiced, and the fundamental pitch. This can be demonstrated easily. If one were to hold one's mouth wide open, without any movement of the jaw, tongue, or lips, the only remaining changeable characteristics of sound generated by the vocal system are the above three parameters.
At any given time, excitation signals actually contain sounds at many different frequencies. A voiced excitation signal is periodic. The energy in its frequency spectrum lies at multiples of the fundamental pitch, which is equal to the frequency with which the vocal cords are vibrating. An unvoiced excitation signal contains a random mixture of frequencies similar to what is generally called white noise.
The vocal cavity ‘shapes’ the excitation signal into recognizable speech sounds by attenuating certain frequencies in the signal while amplifying others. The vocal cavity is able to accomplish this spectral shaping because it resonates at frequencies that vary depending on the positions of the jaw, tongue, and lips. Frequencies in the excitation signal are suppressed if they are not near a vocal cavity resonance. However, vocal cavity resonances tend to amplify, or make louder, sounds of the same frequency in the excitation signal. The resulting spectral peaks in the speech sounds are called formants. Typically, only the three or four lowest-frequency formants will be below 5000 Hertz. These are the formants most important for intelligibility.
(The upper frequency limit for many audio communication systems, including the public telephone system in the United States, is on the order of 3400 Hertz. This is why speech sounds that differ chiefly in their upper-frequency formant structure, such as ‘f’ and ‘s’, tend to be hard to distinguish on these systems.)
For spoken English, a simple classification of speech sounds according to manner of formation would include vowel, nasal, fricative, and plosive sounds. In the formation of vowels, such as the ‘ee’ sound in ‘speech’ and the diphthong ‘oi’ in ‘voice,’ the breath stream passes relatively unhindered through the pharynx and the open mouth. In nasal sounds, such as the ‘m’ and ‘n’ in ‘man,’ the breath stream passes through the nose. Fricative sounds are produced by forcing air from the lungs through a constriction in the vocal tract so that audible turbulence results. Examples of fricatives include the ‘s’ and ‘ch’ sounds in ‘speech.’ Plosive sounds are created by the sudden release of built-up air pressure in the vocal tract, following the complete closure of the tract with the lips or tongue. The word ‘talk’ contains the plosive sounds T and ‘k’. Except when whispering, the vowel and nasal sounds of spoken English are voiced. Fricative and plosive sounds may be voiced (as in ‘vast’ or ‘den’) or unvoiced (as in ‘fast’ or ‘ten’).
Speech Compression
The parameters computed by coders that follow this vocal tract model fall into two categories: those that control the generation of the excitation signal, and those that control the filtering of the excitation signal.
Two different signal-generating mechanisms are required in order to produce a human-like excitation signal. One mechanism generates a periodic signal that simulates the sound produced by vibrating human vocal cords. The other produces a random signal, similar to white noise, that is suitable for modeling unvoiced sounds. Thus, when a voiced sound must be produced, such as the ‘ee’ in ‘speech,’ the output from the periodic signal generator is used; for the unvoiced ‘sp’ and ‘ch’ sounds in ‘speech,’ the random output from the other generator is used.
In some systems, a weighted combination of the random and periodic excitation is used. This can be helpful in modeling voiced fricative sounds, such as the ‘z’ sound in the word ‘zoo.’ However, many coders restrict the excitation so that it is modeled entirely by either the voiced or unvoiced excitation source. In these coders, selection of the excitation is controlled by a two-valued voicing parameter, typically referred to as the voiced/unvoiced decision.
In addition to the voiced/unvoiced decision, the excitation function is scaled by an amplitude parameter, which adjusts its loudness. Finally, if the system is to generate something other than a monotone, it is necessary for the period of the voiced excitation source to be variable. The parameter that controls this is called the pitch parameter. In summary, three parameters are sufficient to control a simple excitation model (i.e., a model that does not take into account vocal tract differences among people): an amplitude parameter; a voiced/unvoiced parameter; and, if voiced, a pitch parameter that specifies the fundamental periodicity of the speech signal.
Various techniques have been used to simulate the manner in which the human vocal cavity imposes a particular spectral shape on the excitation signal. One of the first techniques developed uses a bank of bandpass filters, similar in many respects to the adjustable multi-band ‘graphic equalizers’ found on some high-end stereo systems. The center frequencies of these filters are fixed; an adjustment in the gain of each filter or channel allows the desired spectrum to be approximated, in much the same way that the spectral characteristics of a stereo system may be varied by adjusting the tone controls.
The chief drawback to this approach is the large number of filters it requires. The number of filters can be reduced if it is possible to control their center frequencies. Specifically, by matching the center frequencies of filters to the desired formant frequencies, one can encode speech with only three or four tunable bandpass filters. The important point here is that, even though the center frequencies of the filters must now be encoded along with the gains of the filters, the total number of parameters required for accurate shaping of the excitation signal is reduced greatly.
Although early speech synthesis systems relied on analog mechanisms to filter and shape the excitation signal, modern speech compression systems rely entirely on digital filtering techniques. With these systems, the decoded speech signal heard at the receiving end is the output of a digitally controlled filter that has as its input the appropriate excitation sequence. Digital control of the filter is accomplished through the use of a mathematical model—in essence, an equation with constants and variables, in which the desired spectral filtering is specified by setting the appropriate values for the variables. Great reductions in the data transmission rate are achievable with this approach because the same mathematical model is pre-loaded into both the encoder and the decoder. Therefore, the only data that must be transmitted are the relatively small number of variables that control the model.
A good example is the technique known as linear prediction, in which speech samples are generated as a weighted linear combination of previous output samples and the present value of the filter input. This yields the following expression for each output sample (S[i]) as a function of previous samples (S[i−1], S[i−2], . . . , S[i−n]), the prediction weights (A[1], A[2], . . . , A[n]) and the filter input (U[i]):S[i]=A[1]S[i−1]+A[2]S[i−2]+ . . . +A[n]S[i−n]+U[i]The filter input in this equation (U[i]) is the product of the amplitude parameter and the excitation sequence. The total number of coefficients in the equation (n) determines how many spectral peaks, or formants, may be approximated.
Once the complete set of parameters (amplitude, voicing, pitch, and spectral parameters) has been specified, a speech decoder can produce a constant speech-like sound. In order to generate intelligible natural-sounding speech, the model parameters need to be updated as often as 40 to 50 times each second. To envision this process, it is helpful to recall how motion pictures work: apparent motion—in this case, a smoothly varying speech sound, rather than a smoothly varying image—is achieved by updating with sufficient frequency what are, in fact, still images. (Some systems that store speech in this format, such as Avaya's Intuity™ AUDIX® multimedia messaging system, allow users to adjust the playback rate without the shift in tone that would accompany, for example, playing a 33⅓ RPM phonograph record at 45. This is accomplished by adjusting how long each set of speech production parameters stays ‘in the gate’ before being updated, in much the same way that ‘slow motion’ is achieved with motion pictures.)
One of the first products to incorporate this style of speech compression was a children's learning aid introduced by Texas Instruments in 1978, the Speak & Spell®. It used ten-coefficient Linear Predictive Coding (LPC-10) to model speech. The data rate for this LPC-10 model was 2400 bits per second. (The actual data rate in the Speak & Spell is considerably less than 2400 bits per second because a one-bit repeat code was used when adjacent parameters were judged to be sufficiently similar.) This low data rate was achieved, in part, by ‘hard-wiring’ the excitation parameters that tend to vary from person to person. This meant that, if people's vocal tract characteristics differed from those that had been built into the speech production model, their voices could not be reproduced without distortion.
The ability to model a wide variety of voices accurately—as well as a variety of non-voice sounds, such as TTY/TDD tones—is achieved by systems in which the excitation function is not hard-wired, but is instead under software control. A good example is the Intuity AUDIX voice messaging system, which uses Code-Excited Linear Prediction (CELP) to model speech. The data rate for typical CELP-based systems ranges from 4800 bits per second to 16,000 bits per second. (The higher data rates are seen more frequently in systems where it is important to maximize the speech quality or reduce the computational complexity of the coder.) Compared with similar-quality uncompressed digitized speech, these techniques yield data rate reductions of at least six-to-one, and as high as twenty-to-one.