A Uniform Resource Locator (URL) is mechanism for specifying locations of resources across a network. A URL uniquely identifies both a resource and a protocol for interacting. Most often, a URL refers to the location of a website, webpage, or document on the World Wide Web (the web) accessible over the Internet. For example, “http://www.example.com,” specifies retrieval of a webpage at the location specified by “www.example.com” utilizing hypertext transfer protocol (“http”). In this scenario, a web browser can accept the URL and display the resulting webpage. Where the URL is unknown, a search engine can be utilized to locate URLs that satisfy a specified query.
In order for a URL to appear in the top “N” search results for a query, a variety of processing needs to be done. At a high level, page content of the URL goes through document processing and page importance ranking. Further, the query itself is processed. Processed document content can then be matched against the processed query to determine if the page contains all query key words. If so, the document becomes a member of the document candidate set. Finally, content of the document candidate set is ranked to determine at which position in the results the URL should appear.
Document processing and page importance ranking involve crawling, indexing, classifying, and ranking. A web crawler or spider can be employed to scour the web for URLs and capture location content. Subsequently, the URL and content are indexed to enable expeditious search. Further, pages are classified and ranked to capture the authority or reliability of content. For example, a webpage is reliable if it provides links to other webpages deemed reliable.
Query processing involves refining the query to facilitate return of desired results. In one instance, the query can be filtered to remove unacceptable characters or strings (e.g., “_”, “+” . . . ). Query alteration can also be applied in which spell correction, steaming, word breaking, and/or acronym expansion are performed to capture user intent better. Of course, at the same time such processing should avoid alterations that actually deviate from original user intent. Finally, more sophisticated query processing can be performed to best capture intent by distinguishing primary query words from secondary words, identifying word proximity, and/or employing natural language understanding, among other things.
Once queried a webpage may need to overcome several barriers in order to participate in dynamic ranking. For instance, where content does not include all exact keywords in a query, it must rely on either query alteration or some form of relaxed document candidate set with fuzzy matching as opposed to literal matching to enter the document candidate set.
Pages that make it into the document candidate set are dynamically ranked and need to obtain a high enough ranking to make it in the top “N” search results. Additionally, the pages may have to overcome various other restrictions such as a host-based diversity constraint. Host-based diversity constraint refers to returning only the top “M” URLs from a specific host and collapsing all others. Of course, rank can also be negatively impacted by blacklists that specify that some URLs or domains are blocked for including SPAM or malicious content, for example.