The present disclosure is generally directed to techniques for natural language generation and, more particularly, to recombination techniques for natural language generation that facilitate test input generation for natural language processing systems.
Watson is a question answering (QA) system (i.e., a data processing system) that applies advanced natural language processing (NLP), information retrieval, knowledge representation, automated reasoning, and machine learning technologies to the field of open domain question answering. In general, conventional document search technology receives a keyword query and returns a list of documents, ranked in order of relevance to the query (often based on popularity and page ranking). In contrast, QA technology receives a question expressed in a natural language, seeks to understand the question in greater detail than document search technology, and returns a precise answer to the question.
The Watson system reportedly employs more than one-hundred different algorithms to analyze natural language, identify sources, find and generate hypotheses, find and score evidence, and merge and rank hypotheses. The Watson system implements DeepQA™ software and the Apache™ unstructured information management architecture (UIMA) framework. Software for the Watson system is written in various languages, including Java, C++, and Prolog, and runs on the SUSE™ Linux Enterprise Server 11 operating system using the Apache Hadoop™ framework to provide distributed computing. As is known, Apache Hadoop is an open-source software framework for storage and large-scale processing of datasets on clusters of commodity hardware.
The Watson system employs DeepQA software to generate hypotheses, gather evidence (data), and analyze the gathered data. The Watson system is workload optimized and integrates massively parallel POWER7® processors. The Watson system includes a cluster of ninety IBM Power 750 servers, each of which includes a 3.5 GHz POWER7 eight core processor, with four threads per core. In total, the Watson system has 2,880 POWER7 processor cores and has 16 terabytes of random access memory (RAM). Reportedly, the Watson system can process 500 gigabytes, the equivalent of one million books, per second. Sources of information for the Watson system include encyclopedias, dictionaries, thesauri, newswire articles, and literary works. The Watson system also uses databases, taxonomies, and ontologies.
Cognitive systems learn and interact naturally with people to extend what either a human or a machine could do on their own. Cognitive systems help human experts make better decisions by penetrating the complexity of ‘Big Data’. Cognitive systems build knowledge and learn a domain (i.e., language and terminology, processes and preferred methods of interacting) over time. Unlike conventional expert systems, which have required rules to be hard coded into an expert system by a human expert, cognitive systems can process natural language and unstructured data and learn by experience, similar to how humans learn. While cognitive systems have deep domain expertise, instead of replacing human experts, cognitive systems act as a decision support system to help human experts make better decisions based on the best available data in various areas (e.g., healthcare, finance, or customer service).
U.S. Pat. No. 8,543,381 discloses replacing words in a language phrase with synonyms to generate new language phrases. U.S. Pat. No. 7,496,621 discloses replacing text in a phrase based on semantic features to generate new language phrases. U.S. Patent Application Publication No. 2002/0026306 discloses a method for choosing a tree adjoining grammar (TAG) based on a reference grammar and a predictive model with the goal of choosing a best TAG to generate a sentence. A paper entitled “An Overview of SURGE: a Reusable Comprehensive Syntactic Realization Component” describes a general purpose natural language generation approach that requires a nearly complete description of a target language to be useful. SimpleNLG™ employs a natural language generation approach that requires a relatively complete grammar description for a target language. A paper entitled “Asking what no one has asked before: using phrase similarities to generate synthetic web search queries” discloses generating data related to search queries. In general, Surge and SimpleNLG are implementations of language realization systems that are powerful, but require significant investment in configuring or programming before the systems can be used to generate language.
TAGs are formal grammars, similar to context free grammars, that are used to describe natural languages. A paper entitled “Integrated Natural Language Generation with Schema-Tree Adjoining Grammars,” describes a complete system for natural language generation using TAGs as a formal grammar that is used to define a target natural language grammar. While using TAGs (as opposed to other formal or ad-hoc languages) to describe rules of a target natural language does yield benefits due to their generative properties, systems that employ TAGs have still required a rather complete definition of the language before the systems can be used to generate sentences for the language. Acrolinx™ is a product whose primary focus is improving writing quality. Acrolinx includes tools for generating language in the context of suggested text replacements to improve readability or better convey a particular message.