Generally speaking, the web pages on the Internet can be classified into two categories, static and dynamic. A static web page is typically a document that has been generated in advance, managed by a file system and accessible to a web server, e.g., a HTML file. The content of a static web page is usually associated with a unique document identifier, e.g., a Uniform Resource Locator (URL).
In contrast, a dynamic web page is typically a document generated dynamically by a web server in response to a particular set of parameters specified by a user in the form of a document fetching request. An important feature distinguishing a dynamic web page from a static web page is that the content of the dynamic web page may no longer be associated with a unique document identifier. Instead, a dynamic web page may be referenced by multiple document identifiers at the same time. A search engine that does not take into account of this feature may waste a significant amount of resources, such as network bandwidth, storage space and processing time, by having web crawlers fetch many duplicate copies of dynamically-generated web pages that share the same content.
Therefore, there is a need for a system that automatically identifies and manages document identifiers that reference the same content and thereby reduces the waste of resources both on the search engine side and the web server side.