Speech recognition has become an important growth sector in the computer industry. The goal of speech recognition is to allow users to interact with computers using natural speech. A variety of disciplines are involved in designing a speech recognition application including: acoustics, signal processing, pattern recognition, phonetics, linguistics, and computer science. Therefore, in order to provide an acceptable level of performance, speech recognition applications are complex programs that typically are computationally intensive. Due to the complexity of speech recognition applications, early implementations of speech recognition software were usually implemented on dedicated speech recognition servers. A typical speech recognition server is a high-performance rack mounted computer system that shares the processing load among multiple processors. Thus, initial speech recognition applications were limited to expensive, high performance computer systems.
More recently, advances in desktop computer performance have facilitated more widespread use of speech recognition technology. Specifically, advances in desktop CPU performance, memory size, and storage capacity enable real-time speech recognition capabilities on desktop scale devices. Thus, users can dictate documents and navigate the computer desktop using their voice. However, handheld devices, such as cellular telephones and personal digital assistants, do not yet provide the comparable levels of performance required by these types of applications.
Additionally, many users find the process of installing and adjusting speech recognition software burdensome. For example, in order to provide optimal performance, the speech recognition application typically needs to be “trained” to recognize the unique nuances of a particular user's voice. This can involve hours of training and correcting the system in order to attain acceptable speech recognition quality. Thus, many computer users regard speech recognition applications as being impractical either due to their hardware requirements or due to the effort associated with installing and training the software. Typically, a special software interface is required in order to control existing software applications using speech recognition. Therefore, until a software interface is created, users cannot use spoken commands to control their software.
Waveform analysis is a simplified implementation of speech recognition technology. Waveform analysis software is typically used on devices having limited hardware capabilities such as cellular phones and Personal Digital Assistants (PDAs) and is programmed by the user to recognize commands spoken by the user. Waveform analysis software attempts to match a user's spoken command with a previously recorded waveform of the command. If the waveform of the spoken command matches the recorded waveform, the software recognizes the waveform and may perform an action associated with the spoken command. Thus, waveform analysis compares waveforms as a whole rather than analyzing the sound components, known as phonemes, of the spoken command. Furthermore, waveform analysis is typically applied to a set of narrowly focused, simple actions such as voice dialing rather than the more general purpose speech recognition software described above.
Another application of speech recognition technology involves voice portal services. Voice portals provide users with automated services and information using a voice interface. Typically, a user calls a voice portal and hears a list of options that they can choose from using voice commands. Users calling a voice portal can choose from a variety of information options including stock quotes, weather and traffic reports, horoscopes, movie and television schedules, and airline arrival and departure information. Services provided by voice portals include restaurant reservations, airline ticket purchases, and Internet navigation.
A typical voice portal system utilizes vast arrays of hardware for receiving telephone calls and processing the voice stream in order to perform some action based upon the user's commands. FIG. 1 shows a typical prior art voice portal system. In FIG. 1, a user 110 calls a voice portal server 130 using phone system 120. Phone system 120 can be either a wireless cellular phone system or a Public Switched Telephone Network (PSTN). Upon calling voice portal server 130, a user 110 typically hears a menu of options. User 110 chooses one of the menu options by speaking a command. Voice portal server 130 interprets the spoken command and performs an action in response. This is similar to a previous technology in which a user hears a menu of options and indicates a preference by pressing the touch tone keys on their telephone.
In one embodiment, voice portal server 130 utilizes a Voice Extensible Markup Language (VoiceXML) interpreter comprising an Automated Speech Recognition (ASR) component 131, a Text to Speech (TTS) component 132, an audio play component 133, a Dual-Tone Multi-Frequency (DTMF) component 134, and a telephone network interface 135. ASR component 131 can be speech recognition software as described above, TTS component 132 is used to convert text into audio to be output to user 110. Audio play component 133 controls the playback of pre-recorded audio outputs from server 130 such as audio menus of options available to user 110. DTMF component 134 allows a user to input commands using the touch tone buttons on a telephone to choose from a menu of options. Telephone component 135 is for controlling telephone connections between voice portal server 130 and both phone system 120. A Voice Over Internet Protocol (VoIP) 136 interface provides an interface to voice applications connecting to the voice portal 130 over an Internet Protocol network connection.
In an exemplary communications session, user 110 utilizes voice portal server 130 as an intermediary to access content or services provided by an Application Web server 150. For example, user 110 initiates communications with voice server 130 using phone system 120 and sends a vocal command to server 130 which is interpreted by ASR component 131. A textual version of the vocal command is created by ASR component 131 and a request is sent to Application Web server 150 via Internet 140. Application server 150 then sends a reply to voice portal server 130. An audio equivalent of the reply from application server 150 can then be generated by TTS component 132 and audio component 133 and sent to user 110. Usually, the reply to user 110 is a menu of options from which user 110 indicates a preference in order to continue the session with application server 150.
While voice portal systems provide speech recognition functionality to users, they are limited in the variety of services they can provide. Most voice portal services provide a fixed set of capabilities to their users that are defined by the available menu options. Thus, user options are defined by the service provider and are usually limited to general services and information, such as weather reports, stock quotes, etc. More importantly, the user cannot control or access application programs resident upon their own computers using voice portal services. Another limitation of system 100 is that the output data is only in the form of speech making it hard to present certain types of data, such as large lists, to the user.
Another emerging speech recognition technology utilizes special purpose speech recognition hardware processors that are typically installed within cellular phones and PDAs. These embedded chips allow a user to initiate actions upon their devices using voice commands. However, this necessitates installing the hardware chip in the user's device, which may not be an option for a device the user is already using due to, for example, a limitation of the device's hardware architecture. Additionally, the hardware chips typically provide a limited range of speech recognition capability and are thus limited to simple applications such as voice dialing.
In order for a user to control the visual interface of application running on small form factor handheld electronic devices (e.g., cellular telephones and PDAs), a user typically navigates the application using a keypad and/or special soft function keys. This can be tedious and time consuming for users due to the limited nature of the display and software control capabilities built into such devices. For example, a user may have to enter long sequences of key-presses to navigate from the main menu of a cellular telephone to the particular display desired by the user. Alternatively, or in conjunction with a keypad, PDAs can provide handwriting recognition pads to facilitate user input, however, with small form factor devices such as cellular telephones, there is not always enough space to include handwriting recognition pads.