1. Field
The present invention relates generally to computational linguistics and, more specifically, to systems, methods, and techniques for adjusting graphs of relationships between documents.
2. Description of the Related Art
The digitalization of everyday life activities has promoted the use of wearable technology and smartphones, and the widespread use of the Internet as a tool to exchange information has contributed heavily to a great increase in the amount of available data during the last few decades. A large majority of this data is unstructured data, such as plain text. As a consequence, the ability to effectively extract information from corpora of text in an automated way has increased in importance.
One often wishes to draw inferences based on information contained in and distributed among a relatively large collection (i.e., corpus) of documents, e.g., among substantially more documents than one has time to read or the cognitive capacity to analyze. Certain types of inferences implicate relationships between those documents and clusters of documents. For example, it may be useful to organize pairs of documents by the similarity of terms in the documents. In some cases, topics can be derived from such organization. Examples might include organizing restaurants based on restaurant reviews, organizing companies based on content in company web sites, organizing current events or public figures based on new stories, and organizing movies based on dialogue.
One family of techniques for making such inferences is computational linguistic analysis of text (e.g., unstructured text) within the documents of a corpus, such as with natural language processing (NLP) techniques or those based on distributional semantics. Distributional semantics is typically used for characterizing semantic similarities between terms (i.e., linguistic items), which is based on the assumption that terms used and occurring in the same contexts tend to purport similar meanings. Computational linguistics is often used to perform semantic similarity analyses within corpora to gauge document pair-wise similarity of the documents according to various metrics or pair-wise measures of relationships between entities, topics, terms, or sentiments discussed in the documents, which may be crafted to yield results like those described above. Through the sophisticated use of such techniques, inferences that would otherwise be impractical are potentially attainable (e.g., in a multi-dimensional analysis of at least two vectors), even from relatively large corpora or clusters of documents.
Among the tools of computational linguistics are semantic similarity graphs. These are graphs (also referred to as networks) in which nodes represent documents, and edges (also called links) connect respective pairs of the nodes. The edges indicate an amount of semantic similarity between the documents represented by the two nodes each edge connects. Such graphs can be quite complicated for large collections of documents and are powerful tools for representing the underlying relationships in corpora.
Semantic similarity graphs (and other document relationship graphs) can be very difficult for users to edit. Often in such graphs, sets of documents will appear to cluster (or explicitly be clustered with subsequent processing) for reasons that are not fully readily apparent to the user. A user may see that the graph contains a large cluster of documents for which the user would like a more fine-grained representation, or the user may see a set of smaller clusters that group the documents in ways that are unhelpful. Even graphs devoid of clusters can suffer from these issues, as relationships may be represented in the graph that are unhelpful to the user. Devising a strategy to edit the semantic similarity graph to mitigate these types problems can be difficult for users.
A user may, for example, only have recourse to a technique that expands or contracts a list of stop-words (e.g., manipulating a “blacklist” of words excluded from the similarity analysis). And it can be very difficult for users to identify every word (e.g., due to a limited familiarity with a certain lexicon) that might be contributing to relationships in the graph that they wish to enhance or suppress. Further, traditional techniques are often relatively slow and computationally resource intensive when responding to such edits.