A client computer connected to the Internet can download digital information from server computers. Client application software typically accepts commands from a user and obtains data and services by sending requests to server applications running on the server computers. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hyper Text Transfer Protocol (HTTP), the Simple Mail Transfer Protocol (SMTP), and the Gopher document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as “the Web.” The Web is an information service on the Internet providing documents and links between documents. It is made up of numerous Web sites located around the world that maintain and distribute electronic documents. A Web site may use one or more Web server computers that store and distribute documents in a number of formats, including the Hyper Text Markup Language (HTML). An HTML document contains text and metadata (commands providing formatting information), as well as embedded links that reference other data or documents. The referenced documents may represent text, graphics, or video.
A Web browser is a client application or, preferably, an integrated operating system utility that communicates with server computers via FTP, HTTP and Gopher protocols. Web browsers receive electronic documents from the network and present them to a user.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
The term “search engine” is often used generically to describe both true search engines and directories, although they are not the same. Search engines typically create their listings automatically by directly or indirectly “crawling” the Web. A directory, on the other hand, depends on humans for its listings, i.e., a person submits a short description for an entire site or editors write a description for sites they review. The present invention is particularly suited (although not necessarily limited) for use in a search engine directly or indirectly gathers information by “crawling” the Web.
Most search engines can be categorized as “simple” search engines, “compilation” search engines, or “complex” search engines. A simple search engine is a coordinated set of programs that generally include (a) a crawler (also called a “spider” or a “bot”) that goes to every page or representative pages on every accessible Web site, analyzes the data therein (content, metadata, and so forth), and traverses each link thereon; (b) an indexer which creates and maintains a huge index (sometimes called a “catalog”) from the pages that have been crawled; and (c) an interface which interactively receives an end-user search request based on inputted search terms and, using the entries in the index, returns URLs of Web pages to the user related to the inputted search terms. Some simple search engines may also have added functionality that allow an end-user to input a natural language query, corrects for misspelled words in search terms, expands searches based on logical synonyms for search terms, or other such features. A compilation search engine looks very similar to a simple search engine from the perspective of an end-user, but a compilation search engine is often little more than an enhanced user interface that submits a single query entered by an end-user to multiple simple search engines and then compiles the results and presents to the end-user as a single list. A complex search engine is both a compilation search engine (compiling search results from other simple search engines) and a simple search engine (conducting its own web crawls). Like a compilation search engine, a complex search engine also looks very much like a simple search engine from the perspective of an end-user.
Whether directly or indirectly, all three types of search engines utilize Web page information gathered by crawlers that visit the universe of accessible Web pages, including returning to previously visited Web sites on a regular basis to look for changes. Everything the crawler finds goes into the index which essentially holds a copy of every Web page that the crawler finds, and if a Web page changes the index is then updated with new information. When an end-user inputs a search query, the interface sifts through the pages recorded in the index to find documents fulfilling a search query and will typically rank the matches in accordance with their relevance.
Of course, the fact that the same Web page can be accessed by many different Uniform Resource Locators (URLs) often results in numerous copies of the same page being indexed by the crawlers. Consequently it is not uncommon for a search engine query to yield results comprising multiple “listed” URLs that ultimately lead to the same Web page resource (the “display” URL), with each listed URL having a different relevance. For a search engine user, multiple listed URLs to the same resource are not particularly useful, and the industry to date has not adequately addressed this shortcoming in the art.