The present invention relates generally to the navigation of electronic data by means of spoken natural language requests, and to feedback mechanisms and methods for resolving the errors and ambiguities that may be associated with such requests.
As global electronic connectivity continues to grow, and the universe of electronic data potentially available to users continues to expand, there is a growing need for information navigation technology that allows relatively naive users to navigate and access desired data by means of natural language input. In many of the most important marketsxe2x80x94including the home entertainment arena, as well as mobile computingxe2x80x94spoken natural language input is highly desirable, if not ideal. As just one example, the proliferation of high-bandwidth communications infrastructure for the home entertainment market (cable, satellite, broadband) enables delivery of movies-on-demand and other interactive multimedia content to the consumer""s home television set. For users to take full advantage of this content stream ultimately requires interactive navigation of content databases in a manner that is too complex for user-friendly selection by means of a traditional remote-control clicker. Allowing spoken natural language requests as the input modality for rapidly searching and accessing desired content is an important objective for a successful consumer entertainment product in a context offering a dizzying range of database content choices. As further examples, this same need to drive navigation of (and transaction with) relatively complex data warehouses using spoken natural language requests applies equally to surfing the Internet/Web or other networks for general information, multimedia content, or e-commerce transactions.
In general, the existing navigational systems for browsing electronic databases and data warehouses (search engines, menus, etc.), have been designed without navigation via spoken natural language as a specific goal. So today""s world is full of existing electronic data navigation systems that do not assume browsing via natural spoken commands, but rather assume text and mouse-click inputs (or in the case of TV remote controls, even less). Simply recognizing voice commands within an extremely limited vocabulary and grammarxe2x80x94the spoken equivalent of button/click input (e.g., speaking xe2x80x9cchannel 5xe2x80x9d selects TV channel 5)xe2x80x94is really not sufficient by itself to satisfy the objectives described above. In order to deliver a true xe2x80x9cwinxe2x80x9d for users, the voice-driven front-end must accept spoken natural language input in a manner that is intuitive to users. For example, the front-end should not require learning a highly specialized command language or format. More fundamentally, the front-end must allow users to speak directly in terms of what the user ultimately wantsxe2x80x94e.g., xe2x80x9cI""d like to see a Western film directed by Clint Eastwoodxe2x80x9dxe2x80x94as opposed to speaking in terms of arbitrary navigation structures (e.g., hierarchical layers of menus, commands, etc.) that are essentially artifacts reflecting constraints of the pre-existing text/click navigation system. At the same time, the front-end must recognize and accommodate the reality that a stream of naive spoken natural language input will, over time, typically present a variety of errors and/or ambiguities: e.g., garbled/unrecognized words (did the user say xe2x80x9cEastwoodxe2x80x9d or xe2x80x9cEasterxe2x80x9d?) and under-constrained requests (xe2x80x9cShow me the Clint Eastwood moviexe2x80x9d). An approach is needed for handling and resolving such errors and ambiguities in a rapid, user-friendly, non-frustrating manner.
What is needed is a methodology and apparatus for rapidly constructing a voice-driven front-end atop an existing, non-voice data navigation system, whereby users can interact by means of intuitive natural language input not strictly conforming to the step-by-step browsing architecture of the existing navigation system, and wherein any errors or ambiguities in user input are rapidly and conveniently resolved. The solution to this need should be compatible with the constraints of a multi-user, distributed environment such as the Internet/Web or a proprietary high-bandwidth content delivery network; a solution contemplating one-at-a-time user interactions at a single location is insufficient, for example.
The present invention addresses the above needs by providing a system, method, and article of manufacture for navigating network-based electronic data sources in response to spoken input requests. When a spoken input request is received from a user, it is interpreted, such as by using a speech recognition engine to extract speech data from acoustic voice signals, and using a language parser to linguistically parse the speech data. The interpretation of the spoken request can be performed on a computing device locally with the user or remotely from the user. The resulting interpretation of the request is thereupon used to automatically construct an operational navigation query to retrieve the desired information from one or more electronic network data sources, which is then transmitted to a client device of the user. If the network data source is a database, the navigation query is constructed in the format of a database query language.
Typically, errors or ambiguities emerge in the interpretation of the spoken request, such that the system cannot instantiate a complete, valid navigational template. This is to be expected occasionally, and one preferred aspect of the invention is the ability to handle such errors and ambiguities in relatively graceful and user-friendly manner. Instead of simply rejecting such input and defaulting to traditional input modes or simply asking the user to try again, a preferred embodiment of the present invention seeks to converge rapidly toward instantiation of a valid navigational template by soliciting additional clarification from the user as necessary, either before or after a navigation of the data source, via multimodal input, i.e., by means of menu selection or other input modalities including and in addition to spoken input. This clarifying, multi-modal dialogue takes advantage of whatever partial navigational information has been gleaned from the initial interpretation of the user""s spoken request. This clarification process continues until the system converges toward an adequately instantiated navigational template, which is in turn used to navigate the network-based data and retrieve the user""s desired information. The retrieved information is transmitted across the network and presented to the user on a suitable client display device.