With the advent of modern computing systems, a variety of personal computing systems and devices have become enabled with conversational systems and applications that allow a user to speak a question to his/her computing device in search of information that will be provided by the computing device in response to the question. For example, in a typical setting, a user speaks a question to her handheld mobile telephone or tablet computing device such as “where is the closest pizza restaurant?,” and the user has an expectation that her device (if equipped with an appropriate application) will respond to the question with a phrase like “I have found three pizza restaurants nearby.” According to some systems, the application may provide the user with addresses and other information responsive to the user's request. In some cases, received questions are processed locally on the user's computing device, for example, where the user's calendar information is interrogated for questions that are calendar-oriented, where a local weather application is interrogated for weather-oriented information, where a local contacts database is interrogated for contacts-oriented information, and the like. If information responsive to the request cannot be obtained locally, some systems use the received request to conduct an Internet-based information search, and Internet-based search results responsive to the user's request are returned to the user.
One of the significant difficulties encountered in the development and implementation of such systems involves language understanding. That is, a problem in the development and implementation of such systems is the understanding of natural language spoken by users so that components of a given spoken utterance may be utilized for executing a computer-enabled function. This difficulty is particularly problematic for developers of new applications or functions that are offered to users to allow users to utilize those applications or functions by voice interaction.
For example, if a provider of taxi services wishes to offer a software application allowing users to speak a request into their handheld computing devices (e.g., mobile telephone) for requesting a taxi, the provider of the application is faced with the daunting task of implementing a language understanding model that will understand the many different ways in which a user may speak a request for taxi services into the application that will be understood by the application in order to provide the requested service. For example, such a computer-enabled taxi service might receive spoken requests such as “I need a taxi,” “Can you get me a car?,” “Is this a taxi service?,” “I need a cab to downtown,” and the like. The problem with such spoken phrases is that they may be spoken in structures, formats and with words and phrasing that are as different as the number of people using the service. That is, the service may receive a request in the form of a spoken utterance that differs from other similar spoken utterances in an almost limitless manner.
For example, in these provided example utterances, several different terms were used to describe the physical device in which the user would be carried, including taxi, car, cab, and the like. In addition, some of the phrases were posed as questions while some of the utterances were posed as statements. Some of the utterances could be understood as requesting a taxi service, while some of the utterances could be understood as a search directed toward purchasing a vehicle. In response to such language understanding difficulties, developers and implementers of language understanding systems typically engage in a very slow, painstaking and labor-intensive effort of teaching the components of a language understanding system the many different variations of terms and phrasing that might be expected by an application providing a service in response to a speech-based request. For example, a data engineer collects utterance data that contains instances of a target user intent. A user experience designer creates labeling instructions that explain the new target intent. A crowd-source engineer creates a crowd-sourcing task where workers apply the labeling instructions to data (various utterances) received by numerous example users in a crowd-sourcing task. A machine-learning expert uses this data to build intent detection models that may determine the intent of a user speaking a request into her computing device, as well as, entity extraction models that extract entities (e.g., terms in a spoken utterance that may be the subject of the utterance such as “taxi”).
In addition, another problem includes issues with the definition of the intent or entities that often surface only at the end of the process when the model performance is measured thus requiring the whole process to be repeated. Overall, such a process can take weeks or even months to develop and impacts the ability of existing providers of language understanding models to extend language understanding to new types of spoken utterances. Such a process also impacts the ability of application or functionality providers to integrate the applications or functionalities into conversational systems because of the inability of such providers to develop complex language understanding models to allow spoken utterances to be understood by their applications and functionalities for causing execution of those applications and functionalities.