The problem of naming, identifying and accessing material is not new in the analog or digital realms. In the analog world, systems such as In Service Book Numbers (ISBN) provide a manner to assign unique names to books, Universal Product Codes (UPC) codes uniquely identify products, and passport numbers identify individual people. In the digital world, one of the most common methods for addressing digital information is Uniform Resource Locators (URLs). URLs provide a well-defined syntax for addressing resources across a range of extendable protocols and name spaces. Not only do URLs exist in the digital world, but also they regularly appear in the analog world in newspapers, on television, and in billboard advertisements.
While the presence of URLs may be widespread, knowledge of URLs is limited. Numerous questions arise including: what is the average length in bytes of the typical URL, the sizes of the shortest and longest URLs, and how compressible URLs may be. Fundamental knowledge of the basic characteristics of URLs may lead to better resource name intensive services.
URLs are among the major contributions to the initial development of the World Wide Web (WWW). URLs provided the syntax to glue together the numerous disparate Internet protocols by breaking named resources into protocol, host, and path components. In this manner, different resources within the name space of a host may be named, different hosts identified, different transport protocols addressed, and new transport protocols added when developed. URLs often contain semantic information including the hierarchical nature of resources, descriptive names, version numbers, and temporal information.
It is advantageous to store collections of documents such as web pages in order to provide quick access to locations on the WWW. URL length, or the distribution of the length as measured in characters of all URLs is an important consideration for any such storage scheme. As document collections become larger and larger, the problems associated with efficient management become increasingly complex. Even such a conceptually simple task as determining the location of a file on disk must balance the demands of limited main memory and processing efficiency. To address this problem, there is a need to efficiently map large numbers of URLs to physical locations in a manner that allows quick searches and does not require excessive storage space.