Queries run against a full open-domain corpus can run into difficulties, especially on terms that have highly context dependent meanings. For instance, if a user is interested in information about the Association for Computational Linguistics, and submits “ACL” as a query term, they are likely to be overwhelmed by information about sports injuries and the anterior cruciate ligament (colloquially referred to as the “ACL”), which is unwanted information that is unrelated to the user's original interest.
If the corpus contains information about what functional domain a document belongs to, then one approach to improving search results can be to facet the search by limiting access to a specific subset of an open domain corpus. In the example given above, faceting would include excluding documents from the medical domain. Manually categorizing documents by domain, however, can be prohibitively expensive and resource-consuming, especially when dealing with extremely large corpora (10+ million documents). Further, any change in the number or granularity of domains could require re-categorizing the documents of the corpus, leading to further expense.