1. Field
The present invention generally relates to usage of Uniform Resource Locators (URLs) and more particularly to machine interpretation, categorization, and usage of URLs for a variety of purposes.
2. Description of Related Art
The Internet is a worldwide system of computer networks and is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web (WWW). The web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) can be used to specify the contents and format of a hypermedia document (e.g., a web page).
In this context, an HTML file is a file that contains the source code interpretable by a web browser for rendering a particular web page. Unless specifically stated, an electronic or web document may refer to either the source code for a particular web page or the web page itself. Each page can contain embedded references to images, audio, video, scripts, Flash objects, and other kinds of objects, or other web documents.
Search engines index a large number of web pages and provide an interface that can be used to search the indexed information by entering certain words or phases to be queried. These search terms are often referred to as “keywords”. Although there are many popular Internet search engines, they are generally constructed using the same three common parts. First, each search engine has at least one, but typically more, “web crawler” (also referred to as “crawler”, “spider”, “robot”) that “crawls” across the Internet in a methodical and automated manner to locate web documents around the world.
Upon locating a document, the crawler can store the document's URL, and follows any hyperlinks associated with the document to locate other web documents. Second, each search engine contains information extraction and indexing mechanisms that extract and index certain information about the documents that were located by the crawler. In general, index information is generated based on the contents of the HTML file associated with the document. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Third, each search engine provides a search tool that allows users, through a user interface, to search the databases in order to locate specific documents, and their location on the web (e.g., by storing a Uniform Resource Locator, or URL), that contain information that is of interest to them.
URLs contain significant amount of information which could be used by applications like Web search, crawling, and sponsored search for improving indexing throughput, and relevance of search results and ad placement. URLs for web pages may be dynamic or static. A dynamic URL can be a page address resulting from the search of a database-driven web site or the URL of a web site that runs a script. This contrasts with static URLs, in which the contents of the web page remain the same unless changes are hard-coded into the HTML.
Web sites can use dynamic URLs for content display, where parameters and values for the parameters are needed. The content of a web page may or may not vary based on certain of these values and presence of certain parameters that are used in searching databases for information responsive to the parameters. URLs that encode such parameters, which can be generated for example, from terms of a search query, are known as URLs for dynamic web pages. Some parameters may have little or no effect on the content of the web page displayed, but instead may reflect, for example, contents of a query used to arrive at that page.
Dynamic URLs can comply with a standard form, as specified presently in RFC 1738, and a URL can be considered standardized if it conforms to the URL specification in force at a given time, for example as in force presently in RFC 1738. An example URL according to RFC specifications is shown in FIG. 1A. FIG. 1A illustrates that levels of a URL includes levels including a level for identification of host and domain (finance.yahoo.com), then one or more levels of static information (e.g., nasdaq), then one or more levels comprising scripts (search.asp) and arguments for the scripts (e.g., ticker=YHOO).
If all URLs were presented in standard form, then determining whether or not a set of two or more URLs actually refer to the same page, are likely to have duplicative information, extracting information from them, or inferring what those pages may be about from the URL would be reasonably straight forward.
However, significant amount of web represent URLs in a non-standard form, making it difficult to extract, using machines, relevant information from the URLs, or determined what components of the URL may mean. Sometimes, non-standard form URLs still may be reasonably easy to parse, in that the non-standardization is limited. FIGS. 1B-1C illustrate URLs 105 and 110 that both can be parsed into 4 levels of information, like the URL of FIG. 1A, except that levels 3 and 4 of both URL 105 and 110 can be further parsed into sublevels according to one or more non-standard delimiters. The “=” sign is used as a delimiter between the key “dir” and the values “apparel” and “cruises” respectively in URL 105 and in URL 110. Likewise, the change between letters to numbers in level 4 of both URL 105 and URL 110 can be considered a non-standard delimiter allowing further subdivision of those levels of the URL.
In many cases, however, it is not so easy to identify appropriate non-standard delimiters that will allow in an appropriate subdivision into sub-levels of more complicated URLs. It would be desirable to have an effective machine-based way to more fully extract information present in non-standard URLs for any of a variety of purposes, including those described above.