A vocoder is a speech analyzer and synthesizer. The human voice consists of sounds generated by the opening and closing of the glottis by the vocal cords, which produces a periodic waveform. This basic sound is then modified by the nose and throat to produce differences in pitch in a controlled way, creating the wide variety of sounds used in speech. There are another set of sounds, known as the unvoiced and plosive sounds, which are not modified by the mouth in said fashion.
The vocoder examines speech by finding this basic frequency, the fundamental frequency, and measuring how it is changed over time by recording someone speaking. This results in a series of numbers representing these modified frequencies at any particular time as the user speaks. In doing so, the vocoder dramatically reduces the amount of information needed to store speech, from a complete recording to a series of numbers. To recreate speech, the vocoder simply reverses the process, creating the fundamental frequency in an oscillator, then passing it into a modifier that changes the frequency based on the originally recorded series of numbers.
Disadvantageously, the actual qualities of speech cannot be reproduced so easily. In addition to a single fundamental frequency, the vocal system adds in a number of resonant frequencies that add character and quality to the voice, known as the formant. Without capturing these additional frequencies and corresponding qualities, the vocoder will not sound authentic.
In order to address this, most vocoder systems use what are effectively a number of coders, all tuned to different frequencies, using band-pass filters. The various values of these filters are stored not as raw numbers, which are all based on the original fundamental frequency, but as a series of modifications to that fundamental needed to modify it into the signal seen in the filter. During playback these settings are sent back into the filters and then added together, modified with the knowledge that speech typically varies between these frequencies in a fairly linear way. The result is recognizable speech, although somewhat “mechanical” sounding. Vocoders also often include a second system for generating unvoiced sounds, using a noise generator instead of the fundamental frequency.
Standard systems to record speech record a frequency from about 300 Hz to 4 kHz, where most of the frequencies used in speech reside, which requires 64 kbit/s of bandwidth, due to the Nyquist Criterion regarding sample rates for highest frequency. In digitizing operations, the sampling rate is the frequency with which samples are taken and converted into digital form. The Nyquist frequency is the sampling frequency which is twice that of the analog frequency being captured. For example, the sampling rate for high fidelity playback is 44.1 kHz, slightly more than double the 20 kHz frequency a person can hear. The sampling rate for digitizing voice for a toll-quality conversation is 8,000 times per second, or 8 kHz, twice the 4 kHz required for the full spectrum of the human voice. The higher the sampling rate, the closer real-world objects are represented in digital form.
Conventional low bit rate vocoders (below 4800 bits per second) use a decision process to determine if excitation is either voiced, e.g., vocal cords or unvoiced, e.g., hiss or white noise, and if voiced, a measure of the vocal pitch. The short term spectrum and the voiced pitch/unvoiced, is transmitted with a new frame approximately every 20 milliseconds via a digital link, and the reconstructed spectrum generator is excited by the pitch or white noise and speech is reproduced.
One of the disadvantages of conventional vocoders is the voice/unvoiced decision and accurate pitch estimation. For English speakers, voice quality is usually acceptable since the algorithms were developed using English speakers, but for other languages, these low bit rate vocoders do not sound natural. Higher bit rate voice excited vocoders do not require any voice/unvoiced decision or pitch tracking and preserve the intelligibility and speaker identification. The principle of operation is to encode the first formant speech band and use it to provide excitation input to the spectrum generator. Formant refers to any of several frequency regions of relatively great intensity in a sound spectrum, which together determine the characteristic quality of a vowel sound.
The vocal tract is characterized by a number of resonances or formants which shape the spectrum of the excitation function, typically three below 3000 Hertz. The first formant contains all components, both periodic (voiced) and non periodic (unvoiced) excitations.
The first formant is encoded using pulse code modulation (pcm), and then analyzing the remainder of the speech spectrum and transmitting the excitation and speech spectrum every 20-25 milliseconds. The received first formant is then decoded and is used as excitation for the spectrum generator to produce natural sounding speech. These vocoders typically use 8000 bits per second or more for natural sounding speech.