Search engines are very robust in dealing with misspelled queries. Large search engines contain components dedicated to analyzing history of user behavior in order to improve accuracy in handling misspelled query terms. Search engines can use previous queries to handle spelling mistakes. For example if a user misspells a query “Sun Francisco,” a typical search engine would recognize “Sun Francisco” as “San Francisco” because the mistake has likely been made before by a number of past users. The search engine creates associations between query terms and a page that the user deems most relevant by recording which link the user clicked on. Alternatively the search engine attempts to explicitly correct the user's mistake by providing explicit suggestions in the “Did you mean:” format. Clicking on a “Did you mean: . . . ” link results in a new search with a suggested alternate spelling.
Search engines are also capable of “disambiguating” or parsing queries to provide results of higher relevance. An indication that a result is relevant is occurs when a user clicks on the result in a search page, and then does not come back to the search page. Search engines are capable of providing relevant results by extracting logical meaning from queries. For example, an input query of the form “best city street restaurant cuisine” can actually be interpreted by the search engine as “What is the best cuisine restaurant in city on street.” More specifically, an unstructured query “best Indian restaurant Potrero Hill”, results in a search engine constructing a complex logical statement. The query is interpreted as a search for “Indian restaurants” within the “Potrero Hill” neighborhood in San Francisco. The query is run against business listings within the “Potrero Hill” neighborhood and the results are returned to the user. Moreover, it is possible to misspell every word in the query, because the search engine is capable of recognizing and correcting the misspellings. The search engine does not use spell checking to correct the misspellings, where instead search engines look at past user behavior, i.e. relevant links are obtained by looking at previously established relevant pages for previous queries of similar types.
Voice recognition systems (VRS) are faced with the same set of challenges as search engines. At a high level, voice recognition systems attempt to map differing user word pronunciations to a “dictionary” or a “grammar.” Voice recognition accuracy is directly proportional to the size of the grammar (i.e. the number of distinct utterances that the voice recognition system is supposed to recognize). Specifically, the more limited the grammar, the more accurate the voice recognition system is.
Typically, a voice recognition system would handle a query such as the one given in the previous example (searching for a restaurant) by limiting a grammar through the use of a decision tree. When using a decision tree, a VRS uses a hierarchy to reduce both grammar and vocabulary. At every level of the hierarchy, the grammar used to recognize speech is limited, and therefore the vocabulary is limited.
For example, a voice recognition system dialogue may proceed as follows. The voice recognition system initially asks a user “What can I do for you?”. The user may reply, “I want to find a restaurant.” The VRS may process this answer against a grammar that is limited to a selected set of “top-level” terms. In this example, the grammar against which the first response is processed may include the word “restaurant”. In response to recognizing the word “restaurant”, the VRS may respond: “I hear a restaurant”, and then ask “In what city?” The user may respond “San Francisco”. The VRS then attempts to identify “San Francisco” using a grammar that is limited to names of cities. The VRS may then ask: “In what neighborhood?” When the user responds, the VRS then may attempt to identify the neighborhood specified in the user's answer using a grammar that is limited to names of neighborhoods within San Francisco. This process may be repeated until the VRS has enough information about the question to finally provide an answer.
As illustrated by the preceding dialog, a VRS that uses a decision tree forces the user to answer multiple questions to traverse the decision tree until finally arriving at an answer. Such an interface is frustrating because users have to wait for the correct dialog instead of speaking the query naturally. Users find it easier to give all of the pertinent information in one complex statement.
Another way a VRS system may improve accuracy is by reducing the number of responses a user may give, effectively reducing the size of the dictionary. A VRS where there are three possible responses “Yes”, “No”, and “I don't know”, is far more accurate than a VRS where the number of options is infinite. Limiting the number of responses a user may give naturally creates hierarchical application structures that continuously reduce the “grammars”. Such limitations are awkward for the user, because rigid structures are not normally used and therefore seem inefficient.
Accurate voice recognition requires large computational resources, and therefore current voice recognition implementations are very costly. In phone-based systems, every step in traversing a decision tree may cost 1 to 1.5 cents. Computational resources are expended on comparing previously-recorded waveform samples to the waveform that was captured. Further, input samples of a VRS are more variable, for example, than the input of text-based search engines.
General purpose voice recognition systems are fairly accurate, but at times VRS can make major errors. The source of the errors stems from the fact that the VRS performs relevance analysis based on which words belong next to each other. For example, if one were to say “I'm better”, a typical VRS would look for words that follow such as “than” or “with” So the voice recognition attempts to simplify the recognition computation by using the vocabulary that is typically used around a specific keyword. On the other hand, search engines cannot rely on such logic because a lot of the queries are seemingly random. For example consider the query “What did ayatollah do?” A voice recognition system most likely will not have the word “ayatollah” in its vocabulary, and therefore will not recognize the word. However a search engine would have been able to recall relevant pages, even if the word “ayatollah” is misspelled.
Previous attempts to improve the accuracy of a voice recognition system focused on a combination of approaches such as augmenting the structure of the voice recognition application to reduce the vocabulary of the user, or restricting the grammar of the user, or on expanding the number of words the voice recognition is able to recognize. Such approaches are limited because the voice recognition system once compiled is fixed and is unable to update dynamically, only to tailor itself to a particular speaker.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.