Many Web sites organize information in one or more Web documents into categories, where some categories may relate to lists of sub-categories. For example, classified-ads Web sites organize ads as a directory having a hierarchy of categories and sub-categories. In classified-ads Web sites, some categories group a set of one or more listings of ad postings for products, services, or other virtual content. Typically, the hierarchy of categories and ad postings within the categories may be contained in one or more listing page(s). Ad postings listed in the listing pages can be contained in their own posting page. Ad postings contained in their own posting page are referred in listing pages by links to the posting pages. Posting pages may include an ad posting for a product or service referred to by a label for the corresponding link.
As another example, products-related Web sites organize products into categories and sub-categories. Product categories and sub-categories may be grouped as a hierarchy of categories, where some categories group a set of one or more listings of products. A product may be described in its own product page.
A crawling service can crawl and extract ad postings from the various classified-ads information Web sites. The crawling service stores ad postings, indexed by a classification category and Web sites. The index of ad postings can be searched. Ad postings in the indexed postings may expire and may be deleted in the Web sites.
A problem that can occur is that posting pages that have been deleted from their respective Web site, may continue to exist in the indexed ad postings. Subsequently, posting pages that no longer exist in a source Web site may still be found as a result of a search in the index of ad postings.
A solution to the problem of a discrepancy between Web sites that have deleted posting pages while the indexed ad postings still contain the respective ad posting, has been to re-crawl the Web sites in order to determine whether a posting page may have been deleted from a Web site. If re-crawling determines that a posting page has expired or has been deleted from a Web site, the ad posting is deleted from the index of ad postings. A problem with this solution has been that, extensive re-crawling effort takes up network bandwidth that could instead be used for end user-oriented tasks. Furthermore, re-crawling can substantially slow the response time of the Web site being re-crawled.
In other words, not re-crawling web sites enough can lead to an index of posting pages containing ad postings corresponding to posting pages that have been deleted from their respective Web sites. Re-crawling too much can lead to overburdening the communications bandwidth and can affect response rates of the re-crawled web sites.