1. Field
The present invention relates generally to computational linguistics and, more specifically, to measuring accuracy with exogenous datasets of graphs formed with unsupervised learning techniques.
2. Description of the Related Art
Often people wish to draw inferences based on information contained in, and distributed among, relatively large collections of documents, e.g., substantially more documents than they have time to read or the cognitive capacity to analyze. Certain types of inferences implicate relationships between those documents. For example, it may be useful to organize documents by the subject matter described in the documents, sentiments expressed in the documents, or topics addressed in the documents. In many cases, useful insights can be derived from such organization, for example, discovering taxonomies, ontologies, relationships, or trends that emerge from the analysis. Examples might include organizing restaurants based on restaurant reviews, organizing companies based on content in company websites, organizing current events or public figures based on new stories, and organizing movies based on dialogue.
One family of techniques for making such inferences is computational linguistic analysis of text, such as unstructured text, within the documents of a corpus, e.g., with natural language processing techniques, like those based on distributional semantics. Computers are often used to perform semantic similarity analyses within corpora to gauge document pair-wise similarity of the documents according to various metrics, or pair-wise measures of relationships between entities, topics, terms, or sentiments discussed in the documents, which may be crafted to yield results like those described above. Through the sophisticated use of computers, inferences that would otherwise be impractical are potentially attainable, even on relatively large collections of documents.
In many cases, the collections of documents are relatively large, for example, more than 100 documents, and in many cases more than 10,000 documents, making it difficult to gauge whether computer-implemented analyses are accurate. For instance, an algorithm may work well for certain classes of topics or documents within the corpus, but other classes of topics or documents may yield low-quality results. Further time and cognitive limitations make it difficult for a human being to effectively manually review each of the documents and compare that document to the algorithm's assessment, causing many analyses relying solely on human review to lead to false conclusions or misleading results, and making it difficult to compare the performance of algorithms.