The use of personal computers (PCs), personal digital assistants (PDAs), Web-enabled phones, wireline and wireless networks, the Internet, Web-based query systems and engines, and the like has gained relatively widespread acceptance in recent years. This is due, in large part, to the relatively widespread availability of high-speed, broadband Internet access through digital subscriber lines (DSLs) (including asymmetric digital subscriber lines (ADSLs) and very-high-bit-rate digital subscriber lines (VDSLs)), cable modems, satellite modems, and the like. Thus far, user interaction with PCs, PDAs, Web-enabled phones, wireline and wireless networks, the Internet, Web-based query systems and engines, and the like has been primarily non-voice-based, through keyboards, mice, intelligent electronic pads, monitors, printers, and the like. This has limited the adoption and use of these devices and systems somewhat, and it has long been felt that allowing for accurate, precise, and reliable voice-based user interaction, mimicking normal human interaction, would be advantageous. For example, allowing for accurate, precise, and reliable voice-based user interaction would certainly draw more users to e-commerce, e-support, e-learning, etc., and reduce learning curves.
In this context, “mimicking normal human interaction” means that a user would be able to speak a question into a Web-enabled device or the like and the Web-enabled device or the like would respond quickly with an appropriate answer or response, through text, graphics, or synthesized speech, the Web-enabled device or the like not simply converting the user's question into text and performing a routine search, but truly understanding and interpreting the user's question. Thus, if the user speaks a non-specific or incomplete question into the Web-enabled device or the like, the Web-enabled device or the like would be capable of inferring the user's meaning based on context or environment. This is only possible through multimodal input.
Several software products currently allow for limited voice-based user interaction with PCs, PDAs, and the like. Such software products include, for example, ViaVoice™ by International Business Machines Corp. and Dragon NaturallySpeaking™ by Scansoft, Inc. These software products, however, allow a user to perform dictation, voice-based command-and-control functions (opening files, closing files, etc.), and voice-based searching (using previously-trained uniform resource locators (URLs)), only after time-consuming, and often inaccurate, imprecise, and unreliable, voice training. Their accuracy rates are inextricably tied to a single user that has provided the voice training.
Typical efforts to implement voice-based user interaction in a support and information retrieval context may be seen in U.S. Pat. No. 5,802,526, to Fawcett et al. (Sep. 1, 1998). Typical efforts to implement voice-based user interaction in an Internet context may be seen in U.S. Pat. No. 5,819,220, to Sarukkai et al. (Oct. 6, 1998).
U.S. Pat. No. 6,446,064, to Livowsky (Sep. 3, 2002), discloses a system and method for enhancing e-commerce using a natural language interface. The natural language interface allows a user to formulate a query in natural language form, rather than using conventional search terms. In other words, the natural language interface provides a “user-friendly” interface. The natural language interface may process a query even if there is not an exact match between the user-formulated search terms and the content in a database. Furthermore, the natural language interface is capable of processing misspelled queries or queries having syntax errors. The method for enhancing e-commerce using a natural language interface includes the steps of accessing a user interface provided by a service provider, entering a query using a natural language interface, the query being formed in natural language form, processing the query using the natural language interface, searching a database using the processed query, retrieving results from the database, and providing the results to the user. The system for enhancing e-commerce on the Internet includes a user interface for receiving a query in natural language form, a natural language interface coupled to the user interface for processing the query, a service provider coupled to the user interface for receiving the processed query, and one or more databases coupled to the user interface for storing information, wherein the system searches the one or more databases using the processed query and provides the results to the user through the user interface.
U.S. Pat. No. 6,615,172, to Bennett et al. (Sep. 2, 2003), discloses an intelligent query system for processing voice-based queries. This distributed client-server system, typically implemented on an intranet or over the Internet accepts a user's queries at the user's PC, PDA, or workstation using a speech input interface. After converting the user's query from speech to text, a two-step algorithm employing a natural language engine, a database processor, and a full-text standardized query language (SQL) database is implemented to find a single answer that best matches the user's query. The system, as implemented, accepts environmental variables selected by the user and is scalable to provide answers to a variety and quantity of user-initiated queries.
U.S. Patent Application Publication No. 2001/0039493, to Pustejovsky et al. (Nov. 8, 2001), discloses, in an exemplary embodiment, a system and method for answering voice-based queries using a remote mobile device, e.g., a mobile phone, and a natural language system.
U.S. Patent Application Publication No. 2003/0115192, to Kil et al. (Jun. 19, 2003), discloses, in various embodiments, an apparatus and method for controlling a data mining operation by specifying the goal of data mining in natural language, processing the data mining operation without any further input beyond the goal specification, and displaying key performance results of the data mining operation in natural language. One specific embodiment includes providing a user interface having a control for receiving natural language input describing the goal of the data mining operation from the control of the user interface. A second specific embodiment identifies key performance results, providing a user interface having a control for communicating information, and communicating a natural language description of the key performance results using the control of the user interface. In a third specific embodiment, input data determining a data mining operation goal is the only input required by the data mining application.
U.S. Patent Application Publication No. 2004/0044516, to Kennewick et al. (Mar. 4, 2004), discloses systems and methods for receiving natural language queries and/or commands and executing the queries and/or commands. The systems and methods overcome some of the deficiencies of other speech query and response systems through the application of a complete speech-based information query, retrieval, presentation, and command environment. This environment makes significant use of context, prior information, domain knowledge, and user-specific profile data to achieve a natural language environment for one or more users making queries or commands in multiple domains. Through this integrated approach, a complete speech-based natural language query and response environment may be created. The systems and methods create, store, and use extensive personal profile information for each user, thereby improving the reliability of determining the context and presenting the expected results for a particular question or command.
U.S. Patent Application Publication No. 2004/0117189, to Bennett (Jun. 17, 2004), discloses an intelligent query system for processing voice-based queries. This distributed client-server system, typically implemented on an intranet or over the Internet, accepts a user's queries at the user's PC, PDA, or workstation using a speech input interface. After converting the user's query from speech to text, a natural language engine, a database processor, and a full-text SQL database are implemented to find a single answer that best matches the user's query. Both statistical and semantic decoding are used to assist and improve the performance of the query recognition.
Each of the systems, apparatuses, software products, and methods described above suffers from at least one of the following shortcomings. Several of the systems, apparatuses, software products, and methods require time-consuming, and often inaccurate, imprecise, and unreliable, voice training. Several of the systems, apparatuses, software products, and methods are single-modal, meaning that a user may interact with each of the systems, apparatuses, software products, and methods in only one way, i.e. each utilizes only a single voice-based input. As a result of this single-modality, there is no context or environment within which a voice-based search is performed and several of the systems, apparatuses, software products, and methods must perform multiple iterations to pinpoint a result or answer related to the voice-based search.
Thus, what is needed are natural language query systems, architectures, and methods for processing voice and proximity-based queries that do not require time-consuming, and often inaccurate, imprecise, and unreliable, voice training. What is also needed are natural language query systems, architectures, and methods that are multimodal, meaning that a user may interact with the natural language query systems, architectures, and methods in a number of ways simultaneously and that the natural language query systems, architectures, and methods utilize multiple inputs in order to create and take into consideration a context or environment within which a voice and/or proximity-based search or the like is performed. In other words, what is needed are natural language query systems, architectures, and methods that mimic normal human interaction, attributing meaning to words based on the context or environment within which they are spoken. What is further needed are natural language query systems, architectures, and methods that perform only a single iteration to pinpoint a result or answer related to a voice and/or proximity-based search or the like.