1. Field of the Invention
The present invention is directed to the general field of systems, methods, and computer program products for performing Internet-based searching. In particular, it deals with a search engine tailored to search the Internet and to return results that contain fewer irrelevant results than present search engines return.
2. Related Art
It has been said that the Internet/network communities are what are pushing the economy forward these days, and it is a fact, that the Internet contains unprecedented volumes of information on just about any topic. The only problem is to find the truly relevant resources. Search engines are what make the Internet useful, because without these tools the chances of finding relevant resources would be significantly diminished. Thus, while the Internet drives the economy, search engines drive the Internet. This is backed by statistics made on users' use of the Internet, which shows that users spend more online time at search engines than anywhere else, including portals.
Yet current search engine technology often leaves one dissatisfied and frustrated, particularly where one would like to find resources on a given subject in a specific context. For example, suppose that a user would like to find information on the Ford Pinto in a legal context (referring to the product liability cases against Ford due to defects in the Ford Pinto models design). A general purpose search engine (GPSE) will typically return numerous irrelevant links if one searches on the term “Pinto,” simply because a GPSE can not recognize a context or a specific subject, e.g. a legal context or law as a subject. This is so due to the fact that GPSEs adopt the strategy of “everything is relevant;” therefore, they try to collect and index all pages on the Internet. Their operations are based on this unedited collection of pages.
To gain more insight into the workings of GPSEs, it is first worth noting that the term “search engine” is typically used to cover several different types of search facilities. In particular, “search engines” may be broken up into four main categories: robots/crawlers; metacrawlers; catalogs with search facilities; and catalogs or link collections.
FIG. 1A illustrates the operation of robots/crawlers. These are characterized by having a process (i.e., a crawler) that traverses the Internet 1, as indicated by arrow 4, on a site-by-site basis and sends back to its host machine 2 the contents of each home page it encounters at various sites 3 on its way, as indicated by arrows 5. Then, as shown in FIG. 1B, the host machine 2 indexes the pages 8 sent back by crawler 7 and files the information in its database 9. Any front-end query looks up the search terms in the information stored in the host's database 9. Existing crawlers generally consider all information to be relevant, and therefore, all home pages on all sites traversed are indexed. Examples of such robots/crawlers include Google™, Altavista™, and Hotbot™.
Metacrawlers, as illustrated in FIG. 2, are characterized in that they offer the possibility of searching in a single search facility 2 and obtaining replies from multiple search facilities 10. The metacrawler serves as a front end to several other facilities 10 and does not have its own “back end.” Metacrawlers are limited by the quality of the information in the search facilities that they employ. Examples of such metacrawlers include MetaCrawler™, LawCrawler™, and LawRunner™.
Catalogs, with or without search facilities, are characterized in that they are collections of links structured and organized by hand. In the case of a catalog with a search indexed depends on the particular GPSE. A user can enter a query into the front-end, and the GPSE will search the indexed pages. This procedure is based on the principle of “everything is relevant,” meaning that the crawler will get and save every page it encounters. Similarly every page saved in memory by the crawler will be indexed. This typical operation of a GPSE is illustrated in FIGS. 1A and 1B, as discussed above (indexing part not shown).