This specification relates to indexing items available on a network.
Network search engines create an index of items available on the network. Items can include documents, images, audio files, video files, and generally any format that can be transmitted digitally over a network. To catalog the items on the network, a network crawling application (crawler), also sometimes referred to as a robot, attempts to access as many items as can be found on the network. For example, the crawling application can begin by accessing a first page, indexing that page, in some cases saving a cached version of the page, and then proceeding on to other pages or items linked to by the first page. This can continue, iteratively, until all links have been followed.
For a public network such as the Internet, website publishers often wish to have their website content included in a search engine's index so that potential visitors interested in the content can locate the website. A publisher can register with one or more search engines to request that their website be included in the index. Registering with a search engine identifies the location of a website to the search engine so that the crawling application can access the site and place it in the search engine index.
A crawling application, however, might not find all content that is available at a given website, or may not extract content from that site that includes all of the detail that the website's publisher would like to include. For example, some website content might not be easily reached by a crawling application because that content is stored in a database application instead of being stored in a file that is linked to by a URL. In other cases, the content can be found by the crawling application, but details such as a title of a video or a caption of an image might not be extracted and associated with the item as desired by the publisher.
To assist a search engine in better indexing a website, a publisher can create a sitemap representing the website. A sitemap can include links to documents and/or other items that the publisher would like to have included in the search engine index. The sitemap can provide details regarding the network items that might not be extracted by a crawling application on its own. A number of established Internet search engine providers have agreed on standard sitemap formats so that publishers can create a single sitemap that is useful to multiple search engines.