1. Technical Field of the Invention
The present invention relates to computers and computer systems employing search engines for use on the World Wide Web and other sources of distributed information, and, in particular, to a method and system for an improved metasearch engine utilizing the original content from multiple, distributed, heterogenous information sources to generate search result rankings, summarizations, and categorizations.
Glossary of Terms
User: An agent, human or machine, which is the source of the information request.
Information Resource: Locations where information is stored electronically. This may include text and multimedia information. The information resources can provide search interfaces to the data they contain and/or provide menu-driven interfaces that allow the using agent to browse the information resources.
Hit: An atomic piece of information. A hit is typically used to refer to a specific document that is returned by a search engine. Hits are selected by the search engine from its typically vast set of documents.
Document: Any piece of electronic information. It can be a multimedia document containing text, graphics, video and sound. It can also be a program or other form of binary data.
Query: An encapsulation of what the user wants. A query can consist of the following: keywords, phrases, boolean logic, numbers, SQL statements, paragraphs or segments thereof, pictures, sketches, the context of the search, the types of documents required, and a list of information sources to contact.
2. Description of Related Art
Since the introduction of the personal computer in the early 1980's, the PC has been subject to constant change, ever increasing in capability and usage. From its earliest form in which the data accessible was limited to that which the user could load from a floppy disk to the typical multi-gigabyte hard drives common on PCs today, the amount of data and the ease of obtaining this data have been growing rapidly. With the fruition of the computer network, the available data is no longer limited to the user's system or what the user can load on their system. Local Area Networks (LANS) are now common in small businesses, and in such networks users may, in addition to their own local data, obtain data from other local stations as well as data available on the local server. Corporate networks and internetworks may connect multiple LANS, thereby increasing the data available to users. Larger still are Wide Area Networks (WANS) and Metropolitan Area Networks (MANs), the latter of which is designed to cover large cities.
The largest such network, commonly known as the World Wide Web or Internet, has introduced vast amounts of diversified information into the business place and home. The individual networks that make up the Internet include networks which may be served from sources such as commercial servers (.com), university servers (.edu), research networks and other networks of computers (.org, net), and military networks (.mil). These networks are located throughout the world and their numbers are ever increasing with an estimated 85,000 new domain registrations presently occurring each month with countless Internet sites spawned from those domains. Recent (1998) estimates on the size of the Internet suggest a staggering 320 million web pages and a U.S. user population over 57 million.
Such dramatic growth, however, is accompanied by a number of difficulties, one of which, as witnessed by most users of the Internet attempting to recover specific information from the vast amounts of data therein, is the logistical problem of effectively searching and recovering specific information on a given topic. Much progress has nonetheless been made in Internet navigation and management since the earliest days in which a user essentially had to know the exact location of specific data. The user's labor was then in entering cryptic command line strings to recover the known, targeted data.
The development and implementation of Hypertext Markup Language (HTML) greatly increased the usability of the Internet by enabling a user to navigate through graphically intensive pages, as opposed to the purely text-based interfaces of the previous decades' devices. This navigation is now facilitated by use of a web browser, e.g., Netscape Navigator, Microsoft Internet Explorer, etc. Hypertext, a method of cross-referencing, is now common on most web sites. A hypertext link appears as a word or phrase distinguishable from the surrounding text by a color or format distinction, or both. A user is able to click on a hypertext link and be transferred to another information service, which is often remote from the site with the originating hypertext link. Through the use of many such hypertext links, sites with similar content can be easily cross-referenced by the web developer allowing a user quick access to supplementary information that is distributed across the Internet.
Further facilitation of information access on the Internet has been made by numerous companies providing information search services, e.g., Infoseek, Yahoo, etc., that provide "engines" to search the Internet, generally at no charge to the user. These companies commonly index the contents of large numbers of web pages, either the page's full text or summaries, and allow a user to search through the indices through the search engines provided on the respective companies' web pages.
Search engines may be defined as programs allowing a user to remotely perform keyword searches on the Internet. The searches may cover the titles of documents, Uniform Resource Locators (URLs), summaries, or full text. Usually, information service providers build indices, or databases, of web page contents through automated algorithms. As described, these indices may be of the full text or only a brief synopsis of a web page's text. By utilizing these automated algorithms, the compilation of indices of large numbers of pages is possible. These algorithms are commonly referred to as Spyders. By using these index building algorithms, Infoseek was able to index the full text of over 400,000 web pages in August of 1995. Generally, the results or "hits" of the search are presented to the user with hypertext links allowing the user to pick and choose the desired results and then transfer to a particular site associated with the selected search results in order to retrieve the desired information on the web pages therein.
Additionally, search engines commonly perform computations on the results of the user's query in order to generate a relative ranking, against the other hits. The rankings assigned to each hit are intended to provide a measure of relevance of the content of a particular information source, identified as containing potentially relevant information, to the query presented. Relevance algorithms are used in most search engines and are based on simple word occurrence measures. For example, if the word `plastic` occurs within the text of a page, then that page will have an expected relevance to a query containing the word plastic. The relevance is then assigned a magnitude in the form of a rank. Often the rank is quantified on a number of factors including the number of times the word occurs in a page, whether the word is in the title of the page, whether the word is in a heading, proximity of multiple search terms appearing in the page's text, etc. More sophisticated relevance algorithms may utilize thesaurus indices to automatically expand on a given query using equivalent phraseology.
A further and more recent improvement is the creation and usage of a so-called metasearch engine, e.g., MetaCrawler, Dogpile, Savvysearch, etc. A metasearch engine parses and reformats a user query. The reformatted queries are then forwarded to numerous search engines with each discrete search engine receiving an appropriately formatted query pursuant to the protocols for that search engine. After retrieving the results from the individual search engines, the metasearch engine presents them to the user. The obvious advantage of these metasearch engines is the simplification of searching due to the elimination of the need for a user to formulate and submit an individual query for each of a number of discrete search engines, a non-trivial task since the formats and protocols of each individual search engine differ markedly. By using a metasearch engine, the user only has to submit a single query, saving effort and time.
Even with the aforedescribed improvements in the search and metasearch systems, most users nonetheless spend a great amount of their search time reviewing and eliminating unwanted and irrelevant search results. Since searches of the various indices register hits merely when a search term and indexed term match, numerous hits are generated that match terms completely out of context and provide the user with meaningless results. Furthermore, present metasearch engines must rely on the individual search engines' results rankings. Each search engine uses algorithms to quantify the respective search engines' query results, these algorithms often having distinctively different ranking techniques. The results presented to the user, therefore, are often non-uniform in the sense that the results, having been obtained through numerous search engines, have relative relevance rankings assigned to them from distinctively different ranking methodologies. Some present day metasearch engines generally re-rank the query results so that the result rankings' appear to share a common scale. However, these re-ranked results are simply converted from their original form to a common form for presentation purposes, and the re-ranks are, therefore, purely aesthetic.
It is, accordingly, a first object of the present invention to provide an improved metasearch engine that uses the original ranks of a query result as assigned by the respective individual search engines and, additionally, re-ranks the results according to the actual content of the information source identified in the search results obtained from the individual search engines.
It is also an object of the invention to provide an improved system and method wherein a metasearch engine provides final ranks for the information sources, as identified by the individual search engines' results, according to a singular relevance algorithm after downloading the full text of the identified information sources, thereby providing a more uniform relevance ranking among the numerous information sources.
It is a further object of the invention to provide a system and method wherein query results summaries are produced within the metasearch engine with reference to the respective query and the full text of the information source document.