When using one of many popular search engines on the Internet to search for a particular topic a user is interested in, most of the results returned are unrelated, or don't have the information the user is searching for. Typically the user ultimately looks through many Web pages before the information is found—if in fact he finds what he is looking for! Different search engines search different parts of the Internet (alternately, the “Web” or “www”) and thus give the user different results. The question often becomes: which search engine to use?
A typical scenario follows. A user wishes to perform research in a specific area or perhaps desires to write an article or report. First, many users would come up with an outline of the report or article on which they are researching. Second, and prior to the pervasiveness of the www, he would be to go to the library and find other articles or reports similar to what he is looking for. This is usually done nowadays through on-line card catalogues.
In contrast to today, a user logs on, brings up a web browser, and connects to a search tool. However, literally hours can be expended entering different keywords in various permutations in an attempt to find relevant documents. The user may actually find very little information for his effort. The user then runs a second search tool and repeats the entire process. After hours of searching and typing the user may have found a few documents.
Many people believe the Internet is a vast library of interconnected resources and the only difference between individual search tools are the techniques they use to find relevant documents. What few realize is that each search tool is searching against their own database of collected indices. These databases are built by the search tool vendors starting with a collection of URLs and following each URL on each page until all have been exhausted. This is typically accomplished through the use of spiders or crawlers. Since these tools are starting at different places, the databases themselves contain different information.
For this reason, meta-search engines are increasing in popularity. These engines access multiple individual search tools and thus multiple databases. The advantage is that by using multiple databases the search is covering more of the Internet and hopefully produces better results. A second advantage of using multiple search engines is collaboration among the results they each produce. For example, if two or more sources return the same document one can say that document is likely more relevant than a single source returning the same or another document.
The problem with each of the methods of searching the Internet is that there is no perfect engine when it comes to finding the information that a user may need. Additionally, there is no absolute basis for a comparison of the engines as each has its own unique features and databases. A site on the Internet that is clearly the best may not be able to be found by querying only one individual search engine, leaving the user to have to go to several search engines to perform a search accurately. In principal, a meta-search engine is a good alternative to individual search engines, but each of the hundreds of meta-search engines uses a different algorithm or method of sorting the results. None of these algorithms stand out as being superior to the others. Meta-search engines are also as commercial as the individual search engines, selling a high return on its list of sites to the highest paying customer. This causes the user to have a poor representation of what is available on the Internet for the topic that they search for.
Meta-search engines are not without their own problems. These engines query individual search engines and parse the source's results page. The parser that a meta-search engine uses must be knowledgeable about the source's results page format. If the format changes, the meta-engine's parser can fail. It has been found that web page results change on some sources every one to two months. For the most part, maintenance on these parsers is done by the developer and generally requires software code changes. For those meta-search engines that are accessed through a web browser, the maintenance is done centrally. Those that are client-based require a software patch to be downloaded. Besides maintenance, meta-search engines are not user configurable. Users cannot modify or fix the broken parser nor can they add their own.
Today's meta-search engines utilize a concept called Web Scraping. Web Scraping involves the process of querying a source, retrieving the results page and parsing the page to obtain the results. At that point, the meta-search engine will then normalize the information, and in many cases, combine them with other results and present a single ranked list. The problem with this approach is that individual sources change the format of their pages often. Web scrapers break when this happens; therefore maintenance is critical. One approach taken by a majority number of meta-search engines is to provide centralized service. That is, the meta-search engine is hosted on a centralized server and access is through a web browser. A few have opted to provide a client application. In both cases, the user is at the mercy of the developer. If a web scraper breaks, the user must wait for the developer to fix it. Centralized approaches are easier to fix and require no software changes on the part of the user. On the other hand, client applications would require a patch to be downloaded. In both cases, the user has no control and cannot add additional sources at will.
It would be desirable to have a new generation of meta-search engines that allow for easy maintenance, user configuration and the ability to perform multiple queries.