A website is a collection of content items, images, videos or other digital content items that are hosted on one or more web servers, usually accessible via the Internet. A given webpage of a website is a document, typically written in HTML and accessible via HTTP, a protocol for transferring information from a web server for display in the web browser of a user. The content items of a given page of a website can usually be accessed from a common root URL called the homepage, and usually reside on the same physical server.
A given website may include several website pages. While a given webpage of a website may contain portions of unique content, often one or more pages of a given website contain one or more common elements, such as templates, which may be identical or nearly identical, and thus, contain duplicative content. For instance, the one or more webpages of a website directed towards the sales of electronics may contain a template that has various links to different areas of the website, such as the links “notebook computers,” “televisions,” “desktop computer,” and “wireless routers,” and “cellular phones,” which when selected by a user of a client device, direct the user to the appropriate page of the website. When a search provider utilizes a crawler to index pages that may be used by a search engine to generate a search result set, however, multiple pages of a website containing the same content, such as templates appearing on the one or more pages of a website, may be indexed by the crawler. A search engine may thereafter search the indexed pages to identify pages responsive to a given query. However, if the content responsive to a given query appears in the template portion of the one or more pages of a given website that have been indexed by the crawler, the search engine may retrieve the one or more pages of the website on which the template appears, resulting in the retrieval of duplicative content.
The process of retrieving and downloading multiple pages with duplicative content appearing in templates, however, results in wasted bandwidth, storage and CPU cycles for the search provider, and further results in inaccurate search results, as users are presented with multiple pages of a website that may contain identical content. Accordingly, there exists a need for systems, methods and computer program products for detecting templates within the pages of a website.