The present disclosure relates generally to the extraction of semantic relations, and more specifically, to using distributional relation detection to extract semantic relations across documents in a corpus.
Much of human communication, whether it is in natural-language text, speech, and/or images, is unstructured. The semantics necessary to interpret unstructured information to solve problems is often implicit and is derived by using background information and inference. Unstructured data is contrasted with structured data, such as data in traditional database tables, where the data is well-defined, and the semantics are explicit. When structured data is used, queries are prepared to answer predetermined questions on the basis of necessary and sufficient knowledge of the meaning of the table headings (e.g., Name, Address, Item, Price, and Date). This is not the case with unstructured information where the semantics are not always explicit and it is often difficult to determine what an arbitrary string of text or an image really means.
With the enormous proliferation of electronic content on the web and within enterprises, unstructured information (e.g., text, images, and speech) is growing far faster than structured information. Whether it is general reference material, textbooks, journals, technical manuals, biographies, or blogs, this content contains high-value knowledge that is often important for informed decision making. The ability to leverage the knowledge latent in these large volumes of unstructured text lies in deeper natural-language analysis that can more directly infer answers to user questions.
Natural-language processing (NLP) techniques, which are also referred to as text analytics, infer the meaning of terms and phrases by analyzing their syntax, context, and usage patterns. Human language, however, is so complex, variable (there are many different ways to express the same meaning), and polysemous (the same word or phrase may mean many things in different contexts) that this presents an enormous technical challenge. Decades of research have led to many specialized techniques each operating on language at different levels and on different isolated aspects of the language understanding task. These techniques include, for example, shallow parsing, deep parsing, information extraction, word-sense disambiguation, latent semantic analysis, textual entailment, and co-reference resolution. None of these techniques is perfect or complete in their ability to decipher the intended meaning. Unlike programming languages, human languages are not formal mathematical constructs. Given the highly contextual and implicit nature of language, humans themselves often disagree about the intended meaning of any given expression.
Detecting semantic relations in text is very useful in both information retrieval and question answering because it enables knowledge bases (KB s) to be leveraged to score passages and retrieve candidate answers. Approaches for extracting semantic relations from text include rule-based methods that employ a number of linguistic rules to capture relation patterns. Other approaches include feature based methods that transform relation instances into a large amount of linguistic features such as lexical, syntactic and semantic features, and that capture the similarity between these features using vectors. Further approaches for extracting semantic relations include those that are kernel-based and focused on using tree kernels to learn parse tree structure related features.