Many search engine services allow users to search for information of various data sources. These data sources may be accessible via various communications links such as intranets and the Internet. Web-based search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The search engine service can identify keywords of any particular web page using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service then creates an index that maps keywords to web pages.
Although search engine services enable rapid discovery of general information regarding a topic of interest, the search engine services are typically not well suited for in-depth analysis of a topic of interest. When a person wants to explore a topic of interest, that person submits a query containing terms describing the topic of interest. The search engine service uses its index to identify web pages that contain those terms and hopefully relate to the topic of interest. The search engine service returns hyperlinks to the web pages along with a short description of each web page. Unfortunately, the query result typically includes web pages that are not of interest to the person and that are ordered so that the web pages of interest might not even be included on the first few pages of the query result. For example, a person who is interested in understanding “semaphores” may submit the query “operating system semaphores” to a search engine service. Although the query result will likely contain many web pages that relate to operating system semaphores, those web pages will include web pages of universities that list semaphores as a topic in an operating system course, web pages offering to sell books on operating systems, web pages of companies that sell operating systems that use semaphores, web pages of authors who have written papers on semaphores, and so on. It can be difficult for a person to search through the pages of a query result to identify a web page of interest.
To make it easier to search a topic of interest, some organizations have collected, organized, and indexed documents on specific domains. These organizations, for example, may collect documents, such as web pages, journal publications, dissertations, and technical reports, to form a corpus of documents for a specific domain. The organizations may use manual techniques to identify and classify documents that should be included in a domain-specific corpus or may attempt to use automated techniques. A person interested in searching a particular topic selects a corpus for a domain related to the topic and then performs queries on that corpus. The usefulness of such a corpus is based in large part on how comprehensively the corpus covers the topics within the domain. For example, a corpus on operating systems that does not include at least one document relating to semaphores would likely not be useful to a person wanting to study semaphores.