1. Field of the Invention
This invention relates to the field of Internet web site searching tools, and particularly to dynamically generating web search engine sitemap files.
2. Description of Background
Before our invention, in order for a search engine to index the web pages of an Internet retailer, a web spider would have to crawl through an entire website, indexing each web page that it discovered along the way. As a solution to such system intensive searching operations the Sitemap protocol has been developed. The Sitemap protocol allows a Webmaster for an Internet retailer to create a sitemap XML file that contains a list of URLs for the retailer's website. In practice, an Internet merchant can place an XML file on a server and thereafter submit the location of the XML file to a search engine. After being notified of the XML file, any web spider implemented by a search engine and supporting the Sitemap protocol can read the retailer's XML file and index all the URLs that are identified in the XML file.
Currently, Google™ has implemented an existing sitemap generator that generates a Sitemap XML file based on a list of provided URLs, the directory paths of a web server, and the access logs of a web server. However, the tool only converts the URL list into the XML format that conforms to the Sitemap XML schema. For generating a sitemap file based on a list of provided URLs, site developers still need to list all the URLs that they want the search engines to index, and this is extremely time consuming and error-prone. Also, it becomes almost impossible to list the URLs of the site that has thousands or millions of pages they want to index. Having large number of pages to index is very normal for Internet retailers who sell thousands or millions of products
Further, the tool will check the HTML files in each directory path and create URL for each corresponding HTML file. However, this approach does not apply to dynamically generated pages through application server, and unfortunately, it is very common for Internet retailers to use application servers for the dynamic generation of web pages and to handle transactions. Also, the existing tool generates a sitemap file based upon the access logs of a web server. However, the drawback is that there is no guarantee that ail of a website's URLs have been selected (clicked) by users and will be available in the access logs. Also, it cannot ensure that the generated sitemap contains only the pages that the retailers want to be indexed by the search engines. For example, shopping cart checkout pages typically should not indexed. Furthermore, the tool cannot provide additional sitemap information such as last modification, priority and the anticipated change frequency of the file.
Currently, there exists another tool that is configured to crawl through a sitemap XML file. However, the tool is very hard to control, thus making it difficult to ensure that a generated sitemap only contains the pages that a retailer wants to be indexed by a search engine. Similarly, the tool is not able to provide additional sitemap information such as the last modification, priority, and change frequency of a file. Additionally, large amounts of CPU resources are required to crawl through the entire site, especially in the case where there are millions of products and there are multiple stores hosted by a server. The internal web spiders will have no knowledge about when pages are created/updated and will always need to spend the CPU resources to crawl the entire site. All these are serious drawbacks to the Internet retailers, especially for those who have thousands or millions of products that they need to maintain.
Because of the drawbacks described above, there exist a need for a framework to dynamically generate Search Engine Sitemap XML files for Internet retailers that use application server to maintain their products and website pages.