There is a serious need within Watson and other natural language processing (NLP) applications to train question answering systems on large amounts of data, which can be accomplished through the use of training data that was created organically by people that, in their natural environment, would normally ask questions. Machine learning for natural language processing is especially difficult because it requires a great deal of data that represents a natural language query. Building chat bots or any interface that takes in input from a person and uses that to retrieve the correct ‘answer’ from a corpus of documents requires a large amount of training data. Oftentimes within these current processes, this process is painfully manual, and requires that many Subject Matter Experts spend months collecting questions from users or risk a useless training set that is not able to produce valuable results in the real world application.
When hosting a search engine, user search queries and relevant metadata are stored in an accessible database, and a subset of these queries can provide an excellent example of natural language questions. A database of user queries collected from the search box of a search engine represent natural language queries. Of the billions of queries, a small subset—about 1%—are fully formed questions with the standard interrogative words “who,” “what,” “when,” “where,” “why,” and “how”. However, many of the remaining queries can naturally read as full questions if an interrogative word was simply appended to the beginning. For example, “side effects of Tylenol” is really “what are the side effects of Tylenol,” except that people have been trained to drop interrogative words from search engines to save time. By utilizing these partially formed questions, training sets for NLP models could be strengthened and brought to market faster.