Technical Field
This disclosure relates generally to information retrieval using question answering systems and, in particular, obtaining training data for such systems.
Background of the Related Art
Question answering (or “question and answering,” or “Q&A”) is a type of information retrieval. Given a collection of documents (such as the World Wide Web or a local collection), a Q&A system should be able to retrieve answers to questions posed in natural language. Q&A is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval, such as document retrieval, and it is sometimes regarded as the next step beyond search engines. Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and it can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. Open-domain question answering deals with questions about nearly everything, and they can only rely on general ontologies and world knowledge. These systems usually have much more data available from which to extract the answer. Systems of this type are implemented as one or more computer programs, executed on a machine (or a set of machines). Typically, user interaction with such a computer program either is via a single user-computer exchange, or a multiple turn dialog between the user and the computer system. Such dialog can involve one or multiple modalities (text, voice, tactile, gesture, or the like). The challenge in building such a system is to understand the query, to find appropriate documents that might contain the answer, and to extract the correct answer to be delivered to the user.
Question answering systems, such as IBM® Watson, require large amount of training data. Obtaining high quality of data, however, is very difficult in any real application. In particular, it may be expensive to have a domain expert annotate large amount of answers as either correct or incorrect to provide the training dataset. Machine learning principles, such as active learning, can save the human experts' time (and thus ameliorate this cost to some degree) by identifying discriminative questions automatically. Then, the human experts can label the correct answers for the discriminative questions selected by the machine learning models, or a machine learning model can label the data autonomously.
Known active learning frameworks, however, have several drawbacks. First, the current frameworks work in a general feature space. Further, the decision about which data to use in the training is made at the level of training instances. In question answering systems, training instances correspond to answers, but the selection needs to be done at the level of questions, because otherwise the benefit of such system would be marginal. Another drawback is that, in the typical solution of this type, the spatial distribution of the training data is not taken into account. Rather, the geometry of the dataset is reflected in nearest neighbor-based algorithms, but these do not translate directly into the question space. In addition, in the existing active learning systems, once a data point is labeled, it is automatically included in the training set. This approach does not assure high quality training data, especially because it does not prevent inclusion of outliers or noise.
Thus, there remains a need to provide techniques to obtain high quality and highly-diversified question-answer pairs to facilitate training of a Q&A system.