1. Technical Field of the Invention
The present invention relates to computers and computer systems employing search engines for use on the World Wide Web and other sources of distributed information, and, in particular, to a method and system for an improved metasearch engine utilizing contextual data on information sources"" contexts.
2. Glossary of Terms
User: An agent, human or machine, which is the source of the information request.
Information Resource: Locations where information is stored electronically. This may include text and multimedia information. The information resources can provide search interfaces to the data they contain and/or provide menu-driven: interfaces that allow the using agent to browse the information resources.
Hit: An atomic piece of information. A hit is typically used to refer to a specific document that is returned by a search engine. Hits are selected by the search engine from its typically vast set of documents.
Document: Any piece of electronic information. It can be a multimedia document containing text, graphics, video and sound. It can also be a program or other form of binary data.
Query: An encapsulation of what the user wants. A query can consist of the following: keywords, phrases, boolean logic, numbers, SQL statements, paragraphs or segments thereof, pictures, sketches, the context of the search, the types of documents required, and a list of information sources to contact.
3. Description of Related Art
Since the introduction of the personal computer in the early 1980""s, the PC has been subject to constant change, ever increasing in capability and usage. From its earliest form in which the data accessible was limited to that which the user could load from a floppy disk to the typical multi-gigabyte hard drives common on PCs today, the amount of data and the ease of obtaining this data have been growing rapidly. With the fruition of the computer network, the available data is no longer limited to the user""s system or what the user can load on their system. Local Area Networks (LANs) are now common in small businesses, and in such networks users may, in addition to their own local data, obtain data from other local stations as well as data available on the local server. Corporate networks and internetworks may connect multiple LANs, thereby increasing the data available to users. Larger still are Wide Area Networks (WANs) and Metropolitan Area Networks (MANs), the latter of which is designed to cover large cities.
The largest such network, commonly known as the World Wide Web or Internet, has introduced vast amounts of diversified information into the business place and home. The individual networks that make up the Internet include networks which may be served from sources such as commercial servers (.com), university servers (.edu), research and other networks of computers (.org, .net), and military networks (.mil). These networks are located throughout the world and their numbers are ever increasing with an estimated 85,000 new domain registrations presently occurring each month with countless Internet sites spawned from those domains. Recent (1998) estimates on the size of the Internet suggest a staggering 320 million web pages and a U.S. user population over 57 million.
Such dramatic growth, however, is accompanied by a number of difficulties, one of which, as witnessed by most users of the Internet attempting to recover specific information from the vast amounts of data therein, is the logistical problem of effectively searching and recovering specific information on a given topic. Much progress has nonetheless been made in Internet navigation and management since the earliest days in which a user essentially had to know the exact location of specific data. The user""s labor was then in entering cryptic command line strings to recover the known, targeted data.
The development and implementation of Hypertext Markup Language (HTML) greatly increased the usability of the Internet by enabling a user to navigate through graphically intensive pages, as opposed to the purely text-based interfaces of the previous decades"" devices. This navigation is now facilitated by use of a web browser, e.g., Netscape Navigator, Microsoft Internet Explorer, etc. Hypeitext, a method of cross-referencing, is now common on most web sites. A hypertext link appears as a word or phrase distinguishable from the surrounding text by a color or format distinction, or both. A user is able to click on a hypertext link and be transferred to another information service, which is often remote from the site with the originating hypertext link. Through the use of many such hypertext links, sites with similar content can be easily cross-referenced by the web developer allowing a user quick access to supplementary information that is distributed across the Internet.
Further facilitation of information access on the Internet has been made by numerous companies providing information search services, e g., Infoseek, Yahoo, etc., that provide xe2x80x9cenginesxe2x80x9d to search the Internet, generally at no charge to the user. These companies commonly index the contents of large numbers of web pages, either the page""s full text or summaries, and allow a user to search through the indices through the search engines provided on the respective companies"" web pages.
Search engines may be defined as programs allowing a user to remotely perform keyword searches on the Internet. The searches may cover the titles of documents, Uniform Resource Locators (URLs), summaries, or full text. Usually, information service providers build indices, or databases, of web page contents through automated algorithms. As described, these indices may be of the full text or only a brief synopsis of a web page""s text. By utilizing these automated algorithms, the compilation of indices of large numbers of pages is possible. These algorithms are commonly referred to as Spyders. By using these index building algorithms, Infoseek was able to index the full text of over 400,000 web pages in August of 1995. Generally, the results or xe2x80x9chitsxe2x80x9d of the search are presented to the user with hypertext links allowing the user to pick and choose the desired results and then transfer to a particular site associated with the selected search results in order to retrieve the desired information on the web pages therein.
Additionally, search engines commonly perform computations on the results of the user""s query in order to generate a relative xe2x80x98rankingxe2x80x99 against the other hits. The rankings assigned to each hit are intended to provide a measure of relevance of the content of a particular information source, identified as containing potentially relevant information, to the query presented. Relevance algorithms are used in most search engines and are based on simple word occurrence measures. For example, if the word xe2x80x98plasticxe2x80x99 occurs within the text of a page, then that page will have an expected relevance to a query containing the word plastic. The relevance is then assigned a magnitude in the form of a rank. Often the rank is quantified on a number of factors including the number of times the word occurs in a page, whether the word is in the title of the page, whether the word is in a heading, proximity of multiple search terms appearing in the page""s text, etc. More sophisticated relevance algorithms may utilize thesaurus indices to automatically expand on a given query using equivalent phraseology.
A further and more recent improvement is the creation and usage of a so-called metasearch engine, e.g., MetaCrawler. A metasearch engine parses and reformats a user query. The reformatted queries are then forwarded to numerous search engines with each discrete search engine receiving an appropriately formatted query pursuant to the protocols for that search engine. After retrieving the results from the individual search engines, the metasearch engine presents them to the user. The obvious advantage of these metasearch engines is the simplification of searching due to the elimination of the need for a user to formulate and submit an individual query for each of a number of discrete search engines, a non-trivial task since the formats and protocols of each individual search engine differ markedly. By using a metasearch engine, the user only has to submit a single query, saving effort and time.
It should be understood that search engines are primarily directed to searchable information resources, i.e., those resources that are either indexed by external search engines or have their own interface for accessing the information stored at that web site. Numerous information resources, however, may not contain a convenient interface or indices to facilitate a search of the contents. Current metasearch engines, e.g., MetaCrawler, provide an interface to several general-purpose search services, e.g., Yahoo, AltaVista, etc., thereby covering a larger portion of the web through use of numerous search engine indices. Another such engine, jango, provides a metasearch interface to various commercial search services within a particular domain or area, e.g., music shopping. Conventional metasearch engines, however, only reference pre-indexed resources and are inadequate to search the aforedescribed non-indexed information resources.
Another difficulty with conventional search engines is that much of the information on even a popular website is not always indexed. For example, dynamic information such as daily news or hot topics, being transitory in nature, may not be a vailable, i.e., indexed on a frequent and regular basis or not at all. Further, the information may be out of date or too general and, therefore, irrelevant even though properly indexed. Additionally, a given search engine, even a metasearch engine, covers only a particular portion of the web, i.e., pertinent information resources or sites may be unavailable on one search engine but not another or not available at all for existing search engines.
Accordingly, users of conventional search and metasearch engines must sift through unwanted and irrelevant search results, spending a great deal of time in this winnowing process. Further, despite their name, conventional metasearch engines are not comprehensive and further searches are required to obtain a reliable search result.
It is, accordingly, an object of the present invention to provide an improved system and method by which a user submits a query to a metasearch engine whereupon the metasearch engine searches a plurality of disparate web-based information resources.
It is also an object of the present invention to provide a system and method for searching a variety of information resources whether indexed or not.
It is a further object of the present invention to provide a system and method whereby domain experts create and manage a number of disparate and distributed information resources to optimize search comprehensiveness.
The present invention is directed to a system and method for facilitating the retrieval of information from a system of distributed computers or information resources. In particular, the system and method of the present invention improves upon metasearch techniques by including information resource profiles that provide directives to the metasearch engine for facilitating information recovery. These information resource profiles additionally allow for metasearches to recover information from non-indexed information resources such as when browsing the web. Contextual searching is further provided via a grouping of the information resource profiles. The present invention is further directed to tools for the creation and management of the information resource profiles.