In the field of artificially intelligent computer systems capable of answering questions posed in natural language, cognitive question answering (QA) systems (such as the IBM Watson™ artificially intelligent computer system or and other natural language question answering systems) process questions posed in natural language to determine answers and associated confidence scores based on knowledge acquired by the QA system. In operation, users submit one or more questions through a front-end application user interface (UI), application programming interface (API), or other service to the QA system where the questions are processed using artificial intelligence (AI) and natural language processing (NLP) techniques to provide answers with associated evidence and confidence measures from an ingested knowledge base corpus for return to the user(s). For example, the Watson Conversation Service provides a simple, scalable and science-driven front-end service for developers to build powerful chat bots to conduct dialogs to answer questions from end users or consumers, thereby providing improved customer care to the end user. Existing QA systems use one or more machine learning algorithms to learn the specifics of the problem from sample labeled data and help make predictions on unlabeled data by using a “training process” which involves providing the QA system with representative inputs and corresponding outputs so that the QA system will learn by example from pairs of representative inputs/outputs which constitute the “ground truth” for the QA system. In such machine learning systems, a classifier service may employ deep learning technologies to extract intent (outputs) from a natural language utterance (inputs) from training data to learn which utterances map to which intents, thereby providing the ability to extract intent not only from utterances it has seen, but also from any utterance based on similarity of such an utterance to what is available in the training data.
Since intent classifiers are typically limited to an application domain of interest to a client who is building the system using the classifier, this can create challenges when different intent classifiers built for different domains are combined in a conversation system. For example, when individually trained classifiers are combined to compete for an incoming utterance/input with the intent/output results provided to an aggregator algorithm which selects the winning intent according to a pre-set decision rule (e.g., on the basis of the computed confidence measure), there is no guarantee that the application domains covered by each intent classifier are disjoint. When the intents from different classifiers overlap totally or partially, an incoming utterance may receive very similar confidence scores from multiple classifiers having overlapping intents. When this happens, the decision on which classifier should win may become highly unstable and unpredictable, and may be determined more by minor statistical fluctuations of the scores than by genuine differences between the intents. And while there have been some proposals to address potential competition among different classifiers by improving the calibration of confidence scores from different classifiers to make the scores more reliable, there are no existing systems for evaluating independently-trained intent classifiers for overlapping intent definitions, alerting the client of such overlaps, and providing recommended solutions for taking precautions to prevent such overlaps from occurring. Nor are traditional QA systems able to automatically evaluate conflicts between different independently-trained intent classifiers involving multiple sets of potentially overlapping intents without employing simple aggregation procedures where all intents are brought into the same classifier or decision logic, requiring that conflicts be identified through a manual trial-and-error process where a developer or end-users inputs utterances to test a trained classifier, and if the classifier makes an error, the developer manually changes the training data for the classifier or modify the intent set without the benefit of an resolution recommendation. As a result, the existing solutions for efficiently bringing different independently-trained intent classifiers into joint use are extremely difficult at a practical level.
A similar difficulty exists even if the intents from two or more application domains are to be used together in a single classifier, so that they compete for an incoming utterance within the joint classifier. Overlaps in the intent definitions caused by overlaps in the underlying training utterances for a given intent may cause unstable decisions due to noise more than to the true boundary between the competing intents. Being able to detect and correct for this when the joint classifier is designed is highly desirable.