There are approximately 890,000 distinct words in the English language, but in general only 10,000 are in the vocabulary of the common educated person. In addition, some words are used much more frequently than others. For example, the top twenty most frequently used words in spoken English are: the, and, I, to, of, a, you, that, in, it, is, yes, was, this, but, on, well, he, have, and for.
It has been estimated that a typical person speaks at a rate of approximate 4 words per second, and that the average word is made of 6.66 phonemes. This means that approximately either 4 words or 27 phonemes per second must be transmitted to accurately convey the information. Definitions vary, but spoken English can be represented by approximately 50 distinct phonemes. Therefore, each of the phonemes can be represented distinctly as a 6-bit number. If phonemes were transmitted as a representation of the speech, approximately 162 bits/second would be required.
As an alternative to transmitting symbols representing phonemes, symbols representing the actual words can be transmitted. Estimates vary, but an educated person has a vocabulary of 10,000 words. A single 15-bit number can be assigned to each of the commonly used words (and word forms) in the English dictionary. If a person speaks at 4 words/second, then 60 bits/second would be necessary to represent the speech using this approach. As a further enhancement to this technique, shorter bit strings may be used to represent the most commonly used words, and even the most commonly used groups of words (“and the” for example). This technique may reduce the required bit rate to as little as 30 bits/second.
The human vocal tract can be represented as a glottal pulse train convolved through a vocal tract convolutional filter (of approximately 10 coefficients). The glottal pulse train represents the pitch of the speech and the filter coefficients determine the other sound characteristics. The pitch and the filter coefficients change as one speaks so each glottal pulse is convolved through a slightly different filter as one speaks to generate the sounds we hear. In an artificial speech generator, changing or updating the coefficients and pitch about 30 times/second is sufficient to generate natural sounding speech. Certain sounds, such as “ssss” or “zzz” do not contain the glottal pulse (are unvoiced), and can be represented as a sound directly from the filter, or with a much higher pitch frequency. Any given person will speak with a certain range of filter coefficients and glottal pulse shapes and frequency, giving them their particular speech sound. As one speaks, this range can be modeled and passed to the speech regenerator to help reconstitute speech that sounds like the original speaker. By passing only the range of pitch and filter coefficients, but not the coefficients themselves, little bandwidth is required to mimic the original speaker.
Prior art patents relating to the present invention include the following patents: U.S. Pat. No. 7,124,082, “Phonetic speech-to-text-to-speech system and method”, Freedman, 2006; U.S. Pat. No. 6,035,273, “Speaker-specific speech-to-text/text-to-speech communication system with hypertext-indicated speech parameter changes”, Spies, 1996; U.S. Pat. No. 5,724,410, “Two-way voice messaging terminal having a speech to text converter”, Parvulescu, 1998.