With the rapid growth of the internet and users of the internet over the past five years, a concomitantly rapid increase in the amount of information available over the internet has developed. While, at first blush, this explosion in information available to the user would seem a welcome asset, it carries along with it several downside aspects to the user, not the least of which are the ever increasing difficulties in sorting through the vast quantities of available information to find those information sources which are most pertinent to the search at hand.
Many search engines, such as Google™ and AltaVista®, for example, are available to users and provide powerful search tools for general use. These search engines enable any user to query the vast repository of public web-based documents that are indexed by these systems. However, the sheer volume of available data causes an undesirable result in many of these general searches as most simple searches return large and unmanageable volumes of hits or results, many of which are not useful or relevant to that which the user is seeking.
Most of the available search engines employ different strategies from one another in attempting to find matches to information which is most relevant to the user-supplied search criteria. Therefore, each search strategy imposes its own bias with regard to the relevancy of documents that are retrieved, and one search engine may provide superior results for any given search, while another search engine may provide superior search results for a second, different search. For example, a search engine may determine the relevance of a document by the number of “hits” or matches of any of the key words in the user-supplied query to actual occurrences of those words (or other search terms) in the document. However, the mere repetition of a relevant term is no guarantee that the document is relevant, and often the content of a document identified in this way has little or no relevance to the subject of interest to the user. This results in great expenditures of time, as the user must open documents which are indicated to be relevant, and read them to make a determination as to whether they are in fact relevant, in effect requiring a great deal of “manual searching” by the user to get to the documents actually needed.
Further, different search engines often set different priorities as to which sites to index, and therefore collect disparate results with regard to the same user-supplied query, even prior to making any relevancy assignments.
Another way of attempting to retrieve relevant documents is by filtering, wherein an interface is provided to allow the user to set parameters to arrive at a set of relevant terms. In this way, the user manually determines which items in a set of relevant items delivered are the most relevant. This approach has the potential of eliminating some of the time required to cull through non-relevant documents that might have otherwise been provided by the previous approach discussed. However, time is still required for manual settings. Additionally, the manual settings may potentially eliminate relevant documents which would have otherwise been presented by the previously described approach.
Metasearch engines are available (for example, metacrawler®, Dogpile®, Search.com, etc) which act as a “middle-man” between the user and a number of search engines of the types described above. In this way, a user can submit a single query to a metasearch engine, and the metasearch engine then parses and reformats the query. The reformatted queries are then forwarded to numerous search engines, such as those described above, with each discrete search engine receiving an appropriately formatted query pursuant to the protocols for that search engine. After retrieving the results from the individual search engines, the metasearch engine presents them to the user. Aside from the simplification provided to the user in having to format only one query, a goal of this approach is that by forming a composite of results, relevant documents that may have been missed by any one search engine employed will be found and retrieved by another.
Although these metasearch engines simplify the query task by the user and are thus somewhat useful and provide a measure of time savings, they do nothing to try and categorize or otherwise make sense of the results to make them more quickly accessible. As such, the user is usually left with a very large set of raw results (relatively unordered documents) to examine. Further, these metasearch engines search generic indexes such as Google™ (permission and/or license may be required for metasearching on Google™) or AltaVista® and do not include sites of specific relevance to the sciences.
Current web-based search engines that employ data mining capabilities include northernlight.com, huskysearch and vivisimo. These systems generally employ some type of unsupervised clustering to group documents by similar topics. These systems are an improvement over the generic metasearch engines described above in that the user can see the search results provided in clusters or sub-groups and can then potentially eliminate clusters or sub-groups which appear to have low relevance value and/or can more quickly access those documents in sub-groups which appear highly relevant. In none of these examples, however, have data mining algorithms been tuned specifically to the sciences, or more particularly, the life sciences. Thus, common scientific terminology which has no real discrimination value in a scientific search will be over-weighted, when using these types of systems, as being significant when it is not. Although it is possible to retrieve relevant information to a scientific search using the above generic types of search engines and data mining tools, it is also likely that many relevant documents will not be found, since access to specialized sites (such as PubMed, SwissProt, Entrez, EMBL, etc, in the case of a life sciences search) is not directed.
Attempts at providing domain-specific implementation of metasearch tools have been made which include searchlight.cdlib.org, researchville.com, bio-crawler, gateway.nlm.nih.gov and queryseverver.com. Searchlight provides a few scientifically focused metasearches but has no clustering capability. researchville.com provides a medically oriented implementation, but also lacks any clustering capability. bio-crawler appears to provide biology specific searches in Japanese, but again with no clustering capability. gateway.nlm.nih provides access to various government databases, including medical databases, but also lacks any clustering capability. queryserver.com provides health-oriented metasearches with clustering of results, but is a server-based tool and does not provide the capability of combining both generic and domain specific searches, nor is categorization performed. Being server-based, it's configuration is determined by the server administrator and therefor lacks the potential for end-user customization.
Various client-based solutions for searching have also been proposed. webferret.com provides a simple to use client application that provides metasearch capabilities, but it provides no data mining capabilities and is restricted to a fixed list of generic search engines. DynaCat and QueryCat (http://www.ics.uci.edu/˜pratt/) are applications that use a client tool to query domain-specific information within MedLine. These tools are not metasearch engines and thus do not have the capability of querying multiple search engines.
It would be desirable to have domain-specific tools for efficiently performing scientific metasearches and for organizing the results of such searches to enable the user to quickly identify and access the most relevant information discovered.