The present disclosure is generally directed to transforming questions and, more specifically, to techniques for transforming questions of a question set to facilitate answer aggregation and display by a data processing system, such as a cognitive system or a question answering system.
Watson is a question answering (QA) system (i.e., a data processing system) that applies advanced natural language processing, information retrieval, knowledge representation, automated reasoning, and machine learning technologies to the field of open domain question answering. In general, conventional document search technology receives a keyword query and returns a list of documents, ranked in order of relevance to the query (often based on popularity and page ranking). In contrast, QA technology receives a question expressed in a natural language, seeks to understand the question in greater detail than document search technology, and returns a precise answer to the question.
The Watson system reportedly employs more than one-hundred different algorithms to analyze natural language, identify sources, find and generate hypotheses, find and score evidence, and merge and rank hypotheses. The Watson system implements DeepQA™ software and the Apache™ unstructured information management architecture (UIMA) framework. Software for the Watson system is written in various languages, including Java, C++, and Prolog, and runs on the SUSE™ Linux Enterprise Server 11 operating system using the Apache Hadoop™ framework to provide distributed computing. As is known, Apache Hadoop is an open-source software framework for storage and large-scale processing of datasets on clusters of commodity hardware.
The Watson system employs DeepQA software to generate hypotheses, gather evidence (data), and analyze the gathered data. The Watson system is workload optimized and integrates massively parallel POWER7® processors. The Watson system includes a cluster of ninety IBM Power 750 servers, each of which includes a 3.5 GHz POWER7 eight core processor, with four threads per core. In total, the Watson system has 2,880 POWER7 processor cores and has 16 terabytes of random access memory (RAM). Reportedly, the Watson system can process 500 gigabytes, the equivalent of one million books, per second. Sources of information for the Watson system include encyclopedias, dictionaries, thesauri, newswire articles, and literary works. The Watson system also uses databases, taxonomies, and ontologies.
Cognitive systems learn and interact naturally with people to extend what either a human or a machine could do on their own. Cognitive systems help human experts make better decisions by penetrating the complexity of ‘Big Data’. Cognitive systems build knowledge and learn a domain (i.e., language and terminology, processes and preferred methods of interacting) over time. Unlike conventional expert systems, which have required rules to be hard coded into an expert system by a human expert, cognitive systems can process natural language and unstructured data and learn by experience, similar to how humans learn. While cognitive systems have deep domain expertise, instead of replacing human experts, cognitive systems act as a decision support system to help human experts make better decisions based on the best available data in various areas (e.g., healthcare, finance, or customer service).
U.S. Patent Application Publication No. 2010/0205180 (hereinafter “the '180 publication”) is directed to techniques for identifying and classifying query intent. The '180 publication attempts to identify queries that use different natural language formations to request similar information. Common intent categories are identified for queries requesting similar information. Intent responses are then provided that are associated with identified intent categories. In general, the '180 publication focuses on optimizing a particular query by determining an appropriate intent category and providing appropriate intent responses and, as such, alters a result set.
U.S. Patent Application Publication No. 2013/0187926 (hereinafter “the '926 publication”) is directed to automated presentation of information using infographics. The '926 publication discloses displaying data in the form of an infographic relating to an entity. A body of text stores data (i.e., data associated with an entity), determines an appropriate schema, prompts a user to supply missing data for the schema, and generates one or more infographics.
U.S. Patent Application Publication No. 2006/0122979 (hereinafter “the '979 publication”) is directed to search processing with automatic categorization of queries. The concepts disclosed in the '979 publication mainly work with simple queries that have a minimal number of words and do not deal with full grammatical queries, as employed in natural language questions.