1. Technical Field
A “Voice Search Message Service” provides various techniques for sending or responding to text messages based on a user speech input, and in particular, various techniques for selecting one or more pre-defined probabilistic responses that most closely match an arbitrary user speech input, and then automatically sending one of those responses as a text message.
2. Related Art
Voice search is a technology underlying many spoken dialog systems (SDSs) that provide users with the information they request via a spoken query. The information normally exists in a large database, and the spoken query is compared with a field in the database to obtain the relevant information. The contents of the field, such as business or product names, are often unstructured text. For example, directory assistance is a popular voice search application in which users issue a spoken query and an automated system returns the phone number and/or address information of a business or an individual. Other voice search applications include music/video management, business and product reviews, stock price quotes, and conference information systems.
In general, typical voice search systems operate by attempting to first recognize a user's utterance with an automatic speech recognizer (ASR) that utilizes an acoustic model, a pronunciation model, and language model. The m-best results returned by the ASR are then passed to a search component to obtain the n-best semantic interpretations; i.e., a list of up to n entries in the database. The interpretations are then passed to a dialog manager that uses confidence measures, which indicate the certainty of the interpretations, to decide how to present the n-best results. If the system has high confidence on a few entries, it directly presents them to the user. Otherwise, these types of voice search systems generally interact with the user to understand what he actually needs, or to correct any speech recognition errors.
Unfortunately, one of the limitations of these types of speech enabled applications is the need for accurate speech recognition. For example, if the user speaks the name “Sean Jones” and the system recognizes that name as “John Jones”, the system will return incorrect information from the database. In other words, speech enabled applications generally require accurate speech recognition in order to provide accurate results or responses. Further, it is well known that speech recognition accuracy increases in proportion to the available computing power, and decreases in response to rising noise levels. Consequently, typical speech enabled applications are not well suited where computing power is limited (such as the computing power within a typical mobile phone or the like) or in a noisy environment, such as in a car while driving on the highway.
Some of the problems of accurate speech recognition can be alleviated in “voice command” type systems where the user is limited to speaking only a set of predefined commands or words (e.g. “one”, “two”, “three”, “stop”, “start”, “skip”, etc.) by using a strict context free grammar (CFG) based language model or the like. In this case, the system is less likely to return an error in speech recognition since the possible set of acceptable values (i.e., particular speech utterances) is severely constrained relative to natural spoken language. Unfortunately, in the case of voice command type applications, it is not practical for the user to remember exactly what to say, as demanded by the current technology “voice command”, especially when the list of specific voice commands grows beyond a few simple entries. Consequently, the utility of such systems is limited with respect to applications such as text messaging, where the user may use arbitrarily speech to respond to an arbitrary text message.
Text messages generally include short message service (SMS) type text messages or other text messages transmitted by the user from a mobile phone or other portable or mobile communications devices. Sending or replying to text messages on mobile devices, especially while driving, is a challenging problem for a number of reasons. In fact, in many locations, such as California for example, it is illegal for a driver to type text messages while driving. Further, even with the help of speech recognition, there is no known practical yet safe user interface that can recover speech recognition mistakes without dangerously distracting the driver. Consequently, speech recognition for use in dictating specific text messages is not generally practical in such environments.
For example, in the case of speech enabled applications that require accurate recognition of speech dictation by the user, typical dictation style speech correction user interfaces are simply too demanding of user attention, and thus too dangerous to be considered while the user is driving or piloting a vehicle. In particular, typical user interfaces for correction of speech recognition errors generally require the user to either repeat particular words or phrases until those words or phrases are eventually correctly recognized, or to interact with a user interface, such as display device, to manually select or input text corresponding to the corrected speech dictation of the user. In either case, this type of speech correction user interface is simply too demanding of user attention to be considered safe when any distraction to the user could pose a danger to the user or to others.