Embodiments of the invention generally relate to electronic natural language processing, and more particularly, to generating, modifying, evaluating, and using knowledge graphs.
One task of interest in the natural language processing (NLP) field is to build knowledge graphs corresponding to a knowledge domain. A knowledge domain may refer to a set of interrelated concepts, where two or more concepts may have some relation to one another. Concepts may be considered elements of the knowledge domain. The knowledge domain may include subdomains that refer to specialized forms of that knowledge domain. The knowledge domain itself may be a subdomain of a more generalized domain. Knowledge graphs are data representations of such domains and subdomains. Knowledge graphs may be stored on a computing device as a data structure, and may be viewable on an input/output (I/O) device as a graph. The data structure itself may be a logical graph, and may simply be referred to as a graph.
Knowledge graphs may be structured as interconnected nodes organized in a hierarchical structure. Constructing a knowledge graph may include identifying words or lexical or syntactic features (for example, phrases) as nodes of the graph, and connecting them according to a known or discovered hierarchy. The nodes may also be connected based on known or discovered relations between them. The connections may also be referred to as edges. Generating, expanding, contracting, or otherwise modifying a knowledge graph (including its edges/connections), then, may involve identifying nodes and relations between them. Other ways of structuring a knowledge graph, such as sets of “is-a” links that define categories in the knowledge graph, and their constituent concepts, are also possible.
One way to identify relations between concepts in a domain, or nodes in a knowledge graph, is to perform relation extraction on a reference text. Approaches to relation extraction in NLP systems generally fall into four main categories: supervised, unsupervised, distantly supervised, and bootstrapped.
Supervised techniques tend to have the greatest fidelity; they can more accurately identify entity relations relative to other methods. However, the accuracy that supervised techniques provide comes at a high cost, because these techniques require lager bodies of manually annotated text; the effort of manually annotating text requires replication for each relation and each domain.
Unsupervised techniques fall at the other end of the spectrum, requiring little to no human intervention to discover relations. However, the resulting knowledge graph often includes noise, and mapping the resulting relation clusters onto human readable relations is non-trivial, and sometimes can be nearly impossible, or at least impractical. This prevents proper interpretation and merger with existing knowledge graphs.
Bootstrapping and distant supervision sit somewhere in the middle of the spectrum, requiring some known instances of entity relations, but relying heavily on large collections of unlabeled text to facilitate learning.
Therefore, it would be useful to employ a mechanism to address some or all of these concerns, as recognized and addressed by some embodiments of the invention. It should be noted that addressing these concerns may be a feature of embodiments of the invention, but is not required.