For many organizations and institutions, it is common to use a clipping service to monitor topics of interest in conventional print media. For example, companies often employ a clipping service to monitor what the print media is publishing about a company or its products.
More recently, clipping services have started to monitor electronic media as well. In a simple semi-automated monitoring system, queries that define what is to be monitored are periodically submitted to one or more Web search engines. In order to get a good "recall," the queries may be constructed to retrieve as many relevant pages as possible.
One widely used electronic publishing media is the Internet's World-Wide-Web (the "Web"). A service eWatch offers to monitor documents retrievals, please see, "http://www.ewatch.com." The eWatch service claims to monitor some 40,000 public bulletin boards and preselected Web sites for some four-hundred of the world's largest corporations. There, a key first step is to identify which sites are relevant to a particular client. Because Web pages at the selected sites are retrieved on a daily basis to check whether anything has changed or not, this could become quite expensive when the number of monitored sites is large.
Dartmouth University offers a Web clipping service called the Informant at "http://informant.dartmouth.edu/." This free service only monitors the top ten relevant pages for a particular query plus any Web pages at a preselected set (a maximum of 35 pages per user) of Universal Resource Locators (URL). The service computes a hash value for each current page being monitored, and compares the hash value with the hash value of a previous version of the page. If the hash values are different, the content of the Web page has probably changed. The service is limited in the number of pages that are monitored, and even trivial changes to a Web page will change the hash value so that the Web page is flagged as "interesting."
In general, monitoring pre-selected sites is relatively easy, however, monitoring the entire Web, or even a large portion of the Web is a much more difficult problem. The number of Web sites is easily counted in the millions, with a large proportion of those sites having pages that change on a frequent basis. Active Web "publishers" may change pages on a daily basis, in many cases trivially.
Therefore, the output from the search engine can be quite large. Because humans will eventually have to read and analyze the output it is desirable to mechanically filter the output as much as possible. In particular it is necessary to eliminate pages that have not changed or have not substantially changed since the last retrieval.