Human speech is effected by the unique interaction of the lungs, trachea (windpipe), larynx, pharyngeal cavity (throat), oral cavity (mouth), and nasal cavity. The pharyngeal and oral cavities are known as the vocal tract. The vocal folds (cords), soft palate or velum, tongue, teeth, and lips move to different positions to produce various speech sounds and are known as articulators. Depending on the type of excitation by the larynx and lungs, two types of sounds can be produced, namely voiced and unvoiced sounds or utterances. As used herein, an “utterance” refers to any speech component that is uttered or audibly expressed by a person, including sentences, phrases, words, portions of words, and letters. Voiced speech sounds (for example, the “V” sound in “voice”) are produced by tensing the vocal cords while exhaling. The tensed vocal cords briefly interrupt the flow of air, releasing it in short periodic bursts. The greater the frequency with which the bursts are released, the higher the pitch.
Unvoiced sounds (for example, the final “S” sound in “voice”) are produced when air is forced past relaxed vocal cords. The relaxed cords do not interrupt the air flow; the sound is instead generated by audible turbulence in the vocal tract. A simple demonstration of the role of the vocal cords in producing voice and unvoiced sounds can be had by placing one's fingers lightly on the larynx, or voice box, while slowly saying the word “voice.” The vocal cords will be felt to vibrate for the “V” sound and for the double vowel (or diphthong) “oi” but not for the final “S” sound.
Except when whispering, all vowel and nasal sounds in spoken English are voiced. Plosive sounds—also known as stops—may be voiced or unvoiced. Examples of voiced plosives include the sounds associated with “B” and “D”. Examples of unvoiced plosives include the sounds associated with “P” and “T.” Fricative sounds may also be voiced or unvoiced. Examples of voiced fricatives include the sounds associated with “V” and “Z.” Examples of unvoiced fricatives include the sounds associated with “F” and “S.”
The movement and location of the tongue, jaw, and lips are identical for the “B” and “P” sounds, the only difference being whether the sounds are voiced. The same is true of the “D” and “T” pair, the “V” and “F” pair, and the “Z” and “S” pair. For this reason, accurate detection of the presence or absence of voicing is essential in order to identify the sounds of spoken English correctly.
People having severe injuries, particularly to their cervical or thoracic spine, and people having degenerative neuromuscular diseases, such as Amyotrophic Lateral Sclerosis (also known as Lou Gehrig's Disease), can have difficulty pronouncing voiced sounds, particularly at or near the end of a breath. Such people tend to pronounce words differently depending on where they are in the breath stream. This is because, immediately after they inhale fully and begin to speak, they tend to exhale more forcefully than is the case toward the end of the breath. It is typically the case that such people generally speak softer and more rapidly toward the end of the breath stream. Consequently, though words tend to be pronounced accurately toward the beginning of the breath stream, plosives and fricatives that should be voiced are often pronounced unvoiced as the person reaches the end of the breath stream. By way of illustration, a “d” sound may be pronounced more like a “t”, a “v” sound more like an “f”, a “z” more like an “s”, and so on.
This tendency of people with certain disabilities to pronounce the same sound differently at different degrees of lung deflation can provide substantial obstacles to their use of Automatic Speech Recognition (ASR) Systems. In ASR systems, it is common to compute “confidence levels” that are intended to indicate the likelihood that an utterance was understood correctly. When the computed confidence levels are low, it is common for systems to query the user regarding which of the “best guess” matches was correct. ASR techniques assume that a person's manner of speech remains fairly consistent from the beginning to the end of the breath stream. This assumption fails to consider that, under the same conditions, the voice characteristics of people with certain disabilities can be a moving target. Illustratively, assume that an ASR system's best guess about a spoken word is “pat” (the “p” and “t” both being unvoiced plosives). If the speaker is disabled in the manner described above and is close to the start of the exhaled breath stream, the likelihood that the intended word really is “pat” is high. This means that, in most cases at the start of the breath stream, the ASR system is correct in assuming that the utterance is “pat”. However, if the utterance occurs toward the end of the breath stream, the ASR system has a much lower likelihood of being correct in assuming that the utterance is “pat” rather than “bad”, “pad”, or “bat”. “B” and “d” are voiced plosive sounds that correspond to the unvoiced plosives “p” and “t”. The inability of ASR systems to self-adjust appropriately for these individuals tends to decrease the usability and usefulness of these systems.