The present disclosure is generally directed to techniques for determining a semantic distance between subjects and, more specifically, to techniques for determining a semantic distance between subjects, at least one of which may be associated with user input to a data processing system, such as a question answering system.
Watson is a question answering (QA) system (i.e., a data processing system) that applies advanced natural language processing (NLP), information retrieval, knowledge representation, automated reasoning, and machine learning technologies to the field of open domain question answering. In general, conventional document search technology receives a keyword query and returns a list of documents, ranked in order of relevance to the query (often based on popularity and page ranking). In contrast, QA technology receives a question expressed in a natural language, seeks to understand the question in greater detail than document search technology, and returns a precise answer to the question.
The Watson system reportedly employs more than one-hundred different algorithms to analyze natural language, identify sources, find and generate hypotheses, find and score evidence, and merge and rank hypotheses. The Watson system implements DeepQA™ software and the Apache™ unstructured information management architecture (UIMA) framework. Software for the Watson system is written in various languages, including Java, C++, and Prolog, and runs on the SUSE™ Linux Enterprise Server 11 operating system using the Apache Hadoop™ framework to provide distributed computing. As is known, Apache Hadoop is an open-source software framework for storage and large-scale processing of datasets on clusters of commodity hardware.
The Watson system employs DeepQA software to generate hypotheses, gather evidence (data), and analyze the gathered data. The Watson system is workload optimized and integrates massively parallel POWER7® processors. The Watson system includes a cluster of ninety IBM Power 750 servers, each of which includes a 3.5 GHz POWER7 eight core processor, with four threads per core. In total, the Watson system has 2,880 POWER7 processor cores and has 16 terabytes of random access memory (RAM). Reportedly, the Watson system can process 500 gigabytes, the equivalent of one million books per second. Sources of information for the Watson system include encyclopedias, dictionaries, thesauri, newswire articles, and literary works. The Watson system also uses databases, taxonomies, and ontologies.
Cognitive systems learn and interact naturally with people to extend what either a human or a machine could do on their own. Cognitive systems help human experts make better decisions by penetrating the complexity of ‘Big Data’. Cognitive systems build knowledge and learn a domain (i.e., language and terminology, processes and preferred methods of interacting) over time. Unlike conventional expert systems, which have required rules to be hard coded into an expert system by a human expert, cognitive systems can process natural language and unstructured data and learn by experience, similar to how humans learn. While cognitive systems have deep domain expertise, instead of replacing human experts, cognitive systems act as a decision support system to help human experts make better decisions based on the best available data in various areas (e.g., healthcare, finance, or customer service).
A latent Dirichlet allocation (LDA) is a statistical model utilized in NLP to allow sets of observations to be explained. For example, if observations are words collected into documents, LDA assumes that each document is a mixture of a number of topics and that the creation of words in a document are attributable to one of the topics. For example, a topic may be identified using supervised labeling and/or manual pruning. An LDA analysis may be employed to classify a document based on words in the document. As one example, a document about cats has a relatively high probability of including various cat-related words, e.g., ‘milk’, ‘meow’, ‘kitten’, and ‘cat’. As another example, a document about dogs has a relatively high probability of including dog-related words, e.g., ‘puppy’, ‘bark’, ‘bone’, and ‘dog’.