Previously, electronic devices were controlled by manual input from a human. More recently, voice controlled devices have been introduced including computers having voice control software for control by a human user's speech.
In voice controlled devices, it is desirable to store phrases under voice control. As defined herein, a phrase is defined as a single word, or a group of words treated as a unit. This storing might be to set options or create personalized settings. Once stored, a phrase can later be used as a command or with a command as a data field or other object. A command is usually provided by a user to control a device. For example, in a voice-controlled telephone, it is desirable to store people's names and phone numbers under voice control into a personalized phone book. At a later time, this phone book can be used to call people by speaking their name (e.g. “Cellphone call John Smith”, or “Cellphone call Mother”).
Prior art approaches to storing the phrase (“John Smith”) operate by storing the phrase in a compressed, uncompressed, or transformed manner that attempts to preserve the actual sound. Detection of the phrase in a command (i.e. detecting that John is to be called in the example above) then relies on a sound-based comparison between the original stored speech sound and the spoken command. Sometimes the stored waveform is transformed into the frequency domain and/or is time adjusted to facilitate the match, but in any case the fundamental operation being performed is one that compares the actual sounds. The stored sound representation and comparison for detection suffers from a number of disadvantages. If a speaker's voice changes, perhaps due to a cold, stress, fatigue, noisy or distorting connection by telephone, or other factors, the comparison typically is not successful and stored words are not recognized. Because the word or phrase is stored as a sound representation, there is no way to extract a text-based representation of the word or phrase. A sound stored phrase is strictly sound based. Additionally, storing a sound representation results in a speaker dependent system. It is unlikely that another user could speak the same word or phrase using the same sounds in a command and have it be correctly recognized. It would not be reliable, for example, for a secretary to store phonebook entries and a manager to make calls using those entries. It is desirable to provide a speaker independent storage means.
Additionally, if the words or phrases are stored as sound representations, the stored phrases can not be used in another speech recognition device unless the same waveform processing algorithms are used by both devices. Thus, transferring data associated with the stored sound phrases between devices, such as phone numbers in a phonebook, a cellphone or electronic organizer, is impractical unless the devices use the exact same speech recognition engine. It is desirable to recognize spoken words or phrases and store them in a representation such that, once stored, the phrases can be used for speaker independent recognition, can be used by multiple devices, and can be merged with the representations of other devices. Additionally, it is desirable to store information in text form, and to use it later in voice commands. For example, a text-based phonebook from a personal computer or organizer might be loaded into a cellphone with the text-based representation of the name John Smith and his phone number. In this case, it is desirable for any arbitrary speaker to be able to place a call using voice control (e.g. “Cellphone call John Smith”).