Transmission of information from humans to machines has been traditionally achieved though manually-operated keyboards, which presupposes machines having dimensions at least as large as the comfortable finger-spread of two human hands. With the advent of electronic devices requiring information input but which are smaller than traditional personal computers, the information input began to take other forms, such as pen pointing, touchpads, and voice commands. The information capable of being transmitted by pen-pointing and touchpads is limited by the display capabilities of the device (such as personal digital assistants (PDAs) and mobile phones). Therefore, significant research effort has been devoted to speech recognition systems for electronic devices. Among the approaches to speech recognition by machine is for the machine to attempt to decode a speech signal waveform based on the observed acoustical features of the signal and the known relation between acoustic features and phonetic sounds. This acoustic-phonetic approach has been the subject of research for almost 50 years, but has not resulted in much success in practice (rf. Fundamentals of Speech Recognition, L. Rabiner & B. H. Juang, Prentice-Hall). Problems abound, for example, it is known in the speech recognition art that even in a speech waveform plot, "it is often difficult to distinguish a weak, unvoiced sound (like "f" or "th") from silence, or a weak, voiced sound (like "v" or "m") from unvoiced sounds or even silence" and there are large variations depending on the identity of the closely-neighboring phonetic units, the so-called coarticulation of sounds (ibid.). After the decoding, the determination of the word in the acoustic-phonetic approach is attempted by use of the so-called phoneme lattice which represents a sequential set of phonemes that are likely matches to spoken input. The vertical position of a phoneme in the lattice is a measure of the goodness of the acoustic match to phonetic unit ("lexical access"). But "the real problem with the acoustic-phonetic approach to speech recognition is the difficulty in getting a reliable phoneme lattice for the lexical access stage" (ibid.); that is, it is almost impossible to label an utterance accurately because of the large variations inherent in any language.
In the pattern-recognition approach, a knowledge base of versions of a given speech pattern is assembled ("training"), and recognition is achieved through comparison of the input speech pattern with the speech patterns in the knowledge base to determine the best match. The paradigm has four steps: (1) feature extraction using spectral analysis, (2) pattern training to produce reference patterns for an utterance class, (3) pattern classification to compare unknown test patterns with the class reference pattern by measuring the spectral "distance" between two well-defined spectral vectors and aligning the time to compensate for the different rates of speaking of the two patterns (dynamic time warping, DTW), and (4) decision logic whereby similarity scores are utilized to select the best match. Pattern recognition requires heavy computation, particularly for steps (2) and (3) and pattern recognition for large numbers of sound classes often becomes prohibitive.
Therefore, systems relying on the human voice for information input, because of the inherent vagaries of speech (including homophones, word similarity, accent, sound level, syllabic emphasis, speech pattern, background noise, and so on), require considerable signal processing power and large look-up table databases in order to attain even minimal levels of accuracy. Mainframe computers and high-end workstations are beginning to approach acceptable levels of voice recognition, but even with the memory and computational power available in present personal computers (PCs), speech recognition for those machines is so far largely limited to given sets of specific voice commands. For devices with far less memory and processing power than PCs, such as PDAs, mobile phones, toys, entertainment devices, accurate recognition of natural speech has been hitherto impossible. For example, a typical voice-activated cellular phone allows preprogramming by reciting a name and then entering an associated number. When the user subsequently recites the name, a microprocessor in the cell phone will attempt to match the recited name's voice pattern with the stored number. As anyone who has used present day voice-activated cell phones knows, the match is sometimes inaccurate (due to inconsistent pronunciation, background noise, and inherent limitations due to lack of processing power) and only about 25 stored numbers are possible. In PDA devices, it is necessary for device manufacturers to perform extensive redesign to achieve even very limited voice recognition (for example, present PDAs cannot search a database in response to voice input).
As for spelling words for voice input, there is the problem with the confusable sets: {A,J,K}, {B,C,D,E,G,P,T,V,Z}, {Q,U}, {I,Y}, and {F,S,X}. These can generally only be discriminated based upon a small, critical portion of the utterance. Since conventional recognition relies on a simple accumulated distortion score over the entire utterance duration (a binary "yes" or "no"), this does not place sufficient emphasis on the critical parts resulting in poor recognition accuracy. Clearly, an approach would be to weight the critical portions, but this method has not achieved high recognition accuracy and carries a heavy computational burden.
In summary, the memory and computation necessary for accurate and fast voice recognition also require increased electrical power and complex operating systems; all of these carry increased cost. Thus present voice recognition technology is not feasible for mobile communication devices because of their weight, electrical power requirement, complexity, and cost.