While practically unheard-of, or considered as science fiction, only a few years ago, automatic electronic voice recognition is now a reality. This technology, while complex, is becoming increasingly popular even in consumer devices.
Digital voice recognition is useful for several reasons. First, it offers a user the possibility of increased productivity at work, as a voice-operated device can be used hands-free. For example, a telephone "voice mail" system that uses voice recognition techniques to receive commands from the user can be operated via a user's voice while the user is looking at other things or performing other duties. Second, operating a device by voice commands is more natural for many people than entering cryptic command codes via a keyboard or keypad, such as one on a telephone. Operating a device by voice may seem slightly unnatural at first, as it is a new technology, but most people have been found to acclimate quickly. Finally, when a device is operated with spoken commands, and the user is addressed via a synthesized voice, there is a reduced need to memorize a complex set of commands. Voice commands can be set up using natural phrases, such as "retrieve messages" or "erase," and not sequences of numeric codes and "*" and "#" symbols, as would be necessary on a traditional telephone keypad.
The increase in the popularity of voice recognition systems has been facilitated by a number of technical advances, as well. Only recently has it become possible for a relatively cost-effective consumer-oriented device to perform a satisfactory level of voice recognition.
Over the last several years, there have been order-of magnitude increases in computer performance. It is now possible for a relatively simple special-purpose digital computer to perform the kinds of mathematical calculations and signal processing operations necessary to accomplish voice recognition in real time. In the past, satisfactory voice recognition called for substantial amounts of processing time above and beyond that required to digitally capture the speech.
There have also been extremely significant decreases in price. Powerful special-purpose digital signal processing computer chips are now available at prices that make real-time voice recognition possible in low-priced consumer articles. The cost of other digital components, particularly memory, has also decreased drastically within the last several years.
Finally, there have also been great improvements and refinements in the signal processing algorithms used to accomplish voice recognition. Much research in this area has been undertaken within the last ten to fifteen years, and the refined algorithms now preferred for voice recognition have only recently been developed.
There are numerous types of voice recognition systems in development and use today. These types can be broken down by several characteristics: the vocabulary size, speaker dependency, and continuous vs. discrete speech recognition.
Large vocabulary voice recognition systems are typically used for dictation and complex control applications. These systems still require a large amount of computing power. For example, large vocabulary recognition can only be performed on a computer system comparable to those typically used as high-end personal or office computers. Accordingly, large vocabulary recognition is still not well-suited for use in consumer products.
However, small vocabulary voice recognition systems are still useful in a variety of applications. A relatively small number of command words or phrases can be used to operate a simple device, such as a telephone or a telephone answering machine. Traditionally, these devices have typically been operated via a small control panel. Accordingly, the functions performed by entering codes on the device's control panel can also be performed upon receiving an appropriate voice command. Because only a small number of words and phrases are understood by such a system, a reduced amount of computer processing capability is necessary to perform the required mathematical operations to identify any given spoken command. Thus, low-cost special-purpose digital signal processor chips can be used in consumer goods to implement such a small vocabulary voice recognition system.
Some voice recognition systems are known as "speaker-independent," while others are considered "speaker-dependent." Speaker-independent systems include generic models of the words and phrases that are to be recognized. Such systems need not be "trained" to understand a particular speaker's voice. However, because of this, a user's unusual accents or speech patterns may result in reduced recognition accuracy. On the other hand, speaker-dependent systems require some level of training. That is, the system requires a user to recite several words, or to speak for several minutes, so that the system can adapt its internal word models to match the user's particular manner of speaking. This approach usually results in improved recognition accuracy, but the necessary training before use can be tedious or inconvenient. Moreover, if multiple users will be using a speaker-dependent system, the device must provide for the storage of multiple user voice models, and each user must train the device separately.
Two final categories of voice recognition systems are those systems capable of recognizing continuous speech and those systems only capable of recognizing discrete speech. Continuous speech recognition is most often useful for natural language dictation systems. However, as continuous speech often "runs together" into a single long string of sounds, additional computing resources must be devoted to determining where individual words and phrases begin and end. This process typically requires more processing ability than would be present in a typical low-cost consumer product.
Discrete speech recognition systems require a short pause between each word or phrase to allow the system to determine where words begin and end. However, it should be noted that it is not necessary for each word to be pronounced separately; a small number of short command phrases can be treated as discrete speech for purposes of voice recognition.
While there are advantages to large-vocabulary, speaker-independent, continuous speech recognition systems, it is observed that several compromises must be made to facilitate the use of voice recognition in low-cost consumer articles. Accordingly, it is recognized that small-vocabulary, speaker-dependent, discrete speech recognition systems and methods are still useful in a variety of applications, as discussed above. Even so, additional compromises are necessary to permit the efficiencies in manufacturing and use that would allow such systems to gain acceptance among consumers.
For example, in most speech recognition systems, large amounts of memory are used for various purposes in the recognition system. Buffers are needed to store incoming sampled voice information, as well as to store intermediate versions of processed voice information before recognition is accomplished. These buffers are constantly written and rewritten during training and recognition processing to accommodate voice input, update voice models, alter internal variables, and for other reasons. In most cases, static random-access memory ("static RAM") has traditionally been used in this application; it will be discussed in further detail below.
The traditional low-cost digital memory devices used in most digital voice storage and recognition applications have a significant disadvantage. When power is removed, the memory contents are permanently lost. For example, the least expensive type of digital memory usable for audio recording and processing is dynamic random-access memory ("dynamic RAM"). Audio grade dynamic RAM, which may be partially defective (and thus not usable in data-storage applications) is known as ARAM. When power is disconnected from ARAM, the memory contents are lost. Moreover, ARAM must be periodically "refreshed" by electrically stimulating the memory cells. For these reasons, a battery backup must be provided to preserve ARAM contents when the device is removed from its primary power source. This is inconvenient for the user and adds bulk and expense to a device that uses ARAM. Moreover, additional circuitry can be necessary to provide the necessary refresh signals to the ARAM.
Despite their disadvantages, ARAM devices are in relatively high demand because of their low price point. Accordingly, ARAM devices are sometimes in short supply, causing their price advantage to be nullified.
Static RAM is also a type of volatile digital memory. Static RAM typically provides very fast memory access, but is also power-consuming and expensive. No refresh signals are necessary, but like dynamic RAM, power must be continually supplied to the device, or memory contents will be permanently lost.
With both of the foregoing types of volatile digital memory, speaker-dependent training data and other vital system information can be lost in a power failure unless battery backup is provided. If speaker-dependent training data is lost, the system must be re-trained by each user before it can be used again. As discussed above, training can be inconvenient and tedious, and it may take at least a few minutes.
Several types of non-volatile memory, or memory that retains its contents when power is removed, are also available. EEPROM, or Electrically Erasable Programmable Read-Only Memory, is expensive in the quantities and densities necessary for audio storage and processing. So-called bubble memory is also available; it, too, is expensive, and is generally too slow for advantageous use in audio applications. Finally, flash memory is available. Traditionally, flash memory has been expensive, and very slow to erase and to write. In recent years, the time required to program flash memory has been reduced, and it is now usable for audio recording and processing systems. However, flash memory is subject to a burnout effect. After a limited number of re-writes to a portion of the storage device, that portion of the device will wear out and become unusable.
The problems inherent in using volatile digital memory can be solved by combining a quantity of non-volatile memory with the usual volatile memory. However, this solution is disadvantageous in that it increases the component count, and therefore increases manufacturing expenses. Separate volatile and non-volatile memory components would be necessary when such a solution is used.
In light of the disadvantages of the various volatile and non-volatile digital storage options for voice recognition processing, there is a recognized need for a low-cost voice recognition system that is capable of using low-cost non-volatile memory for substantially all of its storage requirements. Such a system should be able to accommodate a relatively small vocabulary of commands for the control of an electronic device, such as a telephone answering device. Such a system should also be durable and resistant to memory burnout effects.