Automated Speech Recognition (ASR) technology has evolved continuously over the past several decades. However, error rates in a ideal system that performs conversion of human speech to digital text, or that performs recognition of discrete human speech utterances, remain fundamentally dependent upon a number of performance factors including, the desired degree of speaker independence, the desired size of the vocabulary of words or phases to be recognized, and the allowable time separation between spoken words.
In a non-ideal, physically realizable speech recognition system, further errors are introduced by equipment and processes used to capture, transmit, and process human speech include audio noise, distortion, frequency and phase response, sampling, quantization, and digital signal processing errors, transmission errors and latency introduced by a communication system used to convey speech information from the speaker to the conversion system.
The cost of speech recognition in realizable systems increases dramatically with desired performance, particularly when performance goals are high across multiple performance factors. Moreover, interactive applications requiring rapid speech recognition are more costly to implement than non-interactive applications. Realization costs are mitigated when processing time can be extended, as in non-interactive applications.
Single-speaker-dependent systems, in which training time is allowed for both phonetic recognition and for establishing personal vocabularies and word use patterns, i.e. grammar rules, can greatly reduce errors. Such system are not, however, inherently lower in cost. To the contrary, personalization of the conversion process usually entails additional hardware and software requirements.
Traditional speech recognition applications have not been broadly accepted. Directory services and customer care call centers have implemented automated speech recognition (ASR) systems using limited, pre-defined vocabularies to automate information retrieval or perform context-sensitive interactive dialog with callers over the telephone. These systems seek to perform speech recognition for any caller over a telephony connection, and do not generally perform well due to the large variations between speech patterns. Errors induced by telephony equipment and networks also contribution to the poor performance of these systems. Telephony applications of ASR are particularly challenging due to the limited sample rate and digitizing resolution of telephony CODECs (coder/decoders), and the high potential for transmission errors. Mobile telephony further introduces higher ambient noise from the use of mobile handsets.
Speaker-dependent speech recognition software is currently commercially available for free-standing computers such as desktop PCs, but such software has received only limited acceptance, mainly by handicapped individuals, specialized hands-free industrial automation and customer service applications. For the vast majority of desktop computer users, simply typing narrative information as text using a standard keyboard is far superior to ASR. Graphical user interfaces and pointing devices make the navigation and selection of context-sensitive interactive applications straight-forward as well, minimizing the need for speech recognition for selecting command options.
A rather successful, single-speaker-dependent, limited vocabulary, high word separation ASR application is available in current generation mobile telephones. Speech commands can be recognized by mobile phones that compare a prior audible entry with a current spoken entry to retrieve a specific phone number. This application has the advantages of being single speaker-dependent, using a small vocabulary and may require a relatively unnatural time separations between spoken words.
Extending the application described in current generation mobile telephones to the more general application of converting narrative natural speech to text, or reliably detecting a wide variety of context-sensitive or insensitive commands, remains very difficult to accomplish, particularly within a handheld or portable device. Conversion of natural narrative speech having very large and usually very specialized vocabularies, with short and sometimes indiscernible spoken word intervals, remains a technical challenge for even the most advanced and powerful speech recognition systems. Today, and for the foreseeable future, such natural and flexible narrative speech conversion will require, even for a single speaker-dependent system, powerful computing systems running complex software.
The pervasive availability of public wireless telephony and wireless information networks has increased the demand for, and the utility of, speech recognition functionality in handheld computing and mobile phones. Such devices are designed for real time voice communication, and for viewing digital text and graphical information such as email and web pages. However, these devices remain particularly difficult to use for the entry of text information due to the small physical size, and the context of use, which is usually in a mobile scenario, such as use while walking, driving, or traveling.
Text entry is extremely difficult in these situations. Numeric keys (0 thru 9) are tedious to use for alpha character entry. So-called qwerty keyboards, which are available in these units, are relatively small and are suitable only for limited entry of text. Handwriting recognition software using touch-screen displays has received only modest acceptance due to its poor error rate and the unnatural means required for drawing recognizable characters. Reliable speech recognition would be of great benefit to these devices. However, today, and perhaps for the foreseeable future, such devices lack the computing power to implement narrative natural speech to text conversion.