The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
As used herein, a “page” (or “web page”) refers to an online document. An online document may be any set of data including, but not limited to, an image, a Portable Document Format (PDF) document, a set of binary data, and a markup language document. Examples of markup languages include, but are not limited to, HyperText Markup Language (HTML), eXtensible Markup Language (XML), as well as a wide variety of markup languages that are derivatives of the Standard Generalized Markup Language (SGML).
According to current estimates, a significant percentage of the pages on the worldwide web include duplicate content. Correctly identifying pages with duplicate content is important for content search engines because, among other benefits, it can reduce the storage space required for storing content indexes and can improve the quality of search results returned to users.
In one approach, a content search engine uses a shingle-based mechanism for detecting duplicate web pages. As used herein, “shingle” refers to a compact data value that represents a fragment of a page. In this approach, the search engine computes a fingerprint of a given page by computing a collection of shingles, where each shingle in the collection is computed based on a particular fragment that is defined by a sliding window over the content of the given page. The search engine determines that two pages have duplicate content when the two pages have the same or substantially the same fingerprints.
The disadvantage of this duplicate detection mechanism is that in the presence of site-level page templates it produces false positives (e.g. classifying pages as having duplicate content when in fact the pages have different content) and false negatives (e.g. classifying pages as non-duplicates when in fact the pages have the same content). One of the reasons for this disadvantage is that the shingles used to detect the pages with duplicate content may have been computed over page fragments that originate from the same template part of a site-level template that is shared by the pages on a given site.
For example, two different web pages on the same site or host usually share the same site-level template, where the site-level template may be a set of HTML or other markup code that is common to, and determines the layout of, all pages on the particular site or host. When the shingles, which are used by a duplicate detection mechanism to determine whether two pages have duplicate content, originate from the same page portions defined by a site-level template, then the duplicate-detection mechanism would classify the two pages as having duplicate content even though the two pages may in fact have different content. Similarly, when the site-level templates for two different sites are different, the duplicate detection mechanism would classify two pages at the different sites as non-duplicates even though the two pages may in fact have the same content.
To illustrate, consider FIG. 1 which is a block diagram that illustrates an example layout of a web page. (Different sites or hosts may store web pages that have layouts that are different from the page layout illustrated in FIG. 1; for example, different layouts may include a wide variety of different portions in different page positions. It is noted that the techniques described herein are not limited to detecting duplicate pages having any particular layout defined by any particular site-level template, and for this reason the page layout depicted in FIG. 1 is to be regarded in an illustrative rather than a restrictive sense.) In FIG. 1, a site-level template may be used to define the common portions of a typical page 100 stored on the site. The common page portions may comprise one or more advertisement portions 102A-B, a navigation portion 104, and a contact/copyright portion 106. Each of the one more advertisement portions 102A-B may be used on each page of the site to display certain ads. The navigation portion 104 is also common for each page on the site and is used to display buttons and links which a user may use to navigate through the site. The contact/copyright portion 106 is also common for each page on the site and is used to display the same copyright and/or contact information. Content portion 108 is used to display the content of each page on the site; hence, the content portion 108 would likely be different for the different pages on the site.
Suppose now that a shingle-based duplicate detection mechanism computes fingerprints for two pages on the site, where the shingles in each fingerprint are computed over fragments from the advertisement portions 102A-B, the navigation portion 104, and the content/copyright portion 106 (which are common for both pages). The duplicate detection mechanism would compare the fingerprints for each page, would find the shingles therein to be the same, and would classify the two pages as having duplicate content even though the content portions 108 of the two pages may be different.
Based on the foregoing, there is a clear need for techniques that improve the accuracy of duplicate page detection and that overcome the disadvantages of the shingle-based duplicate detection mechanism described above.