1. Field
The present invention relates generally to computational linguistics and, more specifically, to techniques for composing and refining Boolean queries on data sets related to computational linguistics.
2. Description of the Related Art
Often people wish to draw inferences based on information contained in, and distributed among, relatively large collections of documents, e.g., substantially more documents than they have time to read or the cognitive capacity to analyze. Certain types of inferences implicate relationships between those documents. For example, it may be useful to organize documents by the subject matter described in the documents, sentiments expressed in the documents, or topics addressed in the documents. In many cases, useful insights can be derived from such organization, for example, discovering taxonomies, ontologies, relationships, or trends that emerge from the analysis. Examples might include organizing restaurants based on restaurant reviews, organizing companies based on content in company websites, organizing current events or public figures based on new stories, and organizing movies based on dialogue.
One family of techniques for making such inferences is computational linguistic analysis of text, such as unstructured text, within the documents of a corpus, e.g., with natural language processing techniques, like those based on distributional semantics. Computers are often used to perform semantic similarity analyses within corpora to gauge document pair-wise similarity of the documents according to various metrics, or pair-wise measures of relationships between entities, topics, terms, or sentiments discussed in the documents, which may be crafted to yield results like those described above. Through the sophisticated use of computers, inferences that would otherwise be impractical are potentially attainable, even on relatively large collections of documents.
In many cases, the collections of documents are relatively large, for example, more than 100 documents, and in many cases more than 10,000 documents, making it difficult for users to effectively explore the results of analyses. One powerful tool for interrogating such a corpus, or an analysis of such a corpus, is a Boolean query.
Boolean queries are used in a variety of contexts, including to express queries for relational databases and queries for searching natural language in unstructured documents. This query format has the advantage of being relatively expressive and precise. Very complex queries can be expressed as combinations of query elements (like keywords or database field values or ranges) and Boolean operators (like “and,” “or,” and “not”). For these reasons, Boolean queries are often favored by developers of software systems.
Many users, however, struggle to properly formulate Boolean queries. Non-technical users are often not trained in formal logic and find Boolean queries to be nonintuitive and frustrating. Compounding this problem, typical use cases for Boolean queries involve iterative query formulation, by which a user submits a query, reviews the results, and then refines their query, in an iterative process until they reach the search results that they desire. Thus, to use this powerful technique, the user formulates multiple queries, adjusting the queries at the margin, a process that can be particularly nonintuitive for less sophisticated users.