Spoken language understanding (SLU) is a key component in human-computer conversational interaction systems. Existing spoken dialog systems operate in single-user scenarios, where a user speaks to the system and the system gives feedback in response to the user's request, such as checking flight status or finding a restaurant. However, no commercially available system is capable of active involvement in multi-user conversations.
Many existing spoken dialog systems are application-specific and capable of responding only to requests within limited domains. Each domain represents a single content area such as search, movie, music, restaurant, shopping, flights, etc. Limiting the number of domains generally allows spoken dialog systems to be more accurate, but requires the user to resort to different resources for different tasks.
While the utterances addressed to the computer are the main focus of multi-user spoken dialog systems, many problems initially defined on human-human conversation understanding are closely related to this task. For example, dialog act labeling defines the function of each dialog act and their relationship to provide increased understanding of user intent. Addressee detection attempts to differentiate between utterances addressed to another human or to the computer allowing users to speak to the computer naturally without any intervention, such as requiring the user to speak an addressing term (e.g., “computer”) or make an addressing gesture (e.g., pushing a button or looking at a camera) in conjunction with making a request.
Previous work on domain detection studied this problem in a single-user scenario, where there is only one user speaking to the computer and in that dataset, the utterances were independently collected without context, i.e., the user speaks a random query or a series of related queries to the computer. Previous efforts in domain classification have shown that exploiting web query click logs using a semi-supervised method outperforms the fully supervised approach using limited annotated data on domain classification. While simple context (e.g., previous dialog act and speaker labels) has been exploited to help dialog act labeling and topic shift/continue detection has been used to help find relevant answers in a question answering system, context (mainly human-addressed context) has not been studied or used for domain detection and other language understanding tasks, such as user intent detection and slot filling.
It is with respect to these and other considerations that the present invention has been made. Although relatively specific problems have been discussed, it should be understood that the embodiments disclosed herein should not be limited to solving the specific problems identified in the background.