The World Wide Web (also referred to as the “Web”) is an extremely large, inter-networked data system connecting hundreds of millions of informational sites and documents and is growing daily. The inter-linked relationships between the sites create a dynamic system of enormous complexity. Despite the information or “content” dependent utility of the Web, the existing Internet addressing system does not locate or identify sites based on their information content. Thus, one of the persistent problems associated with the Web is finding useful information. Indeed, while the rich, decentralized, dynamic and diverse nature of the Web can make casual Web surfing enjoyable, it has made serious navigation aimed at finding specific information extremely difficult.
In response to the aforementioned problem, several types of Internet/Web navigation, location, finding or searching resources have evolved in an attempt to facilitate the presentation of sites based on content. One such resource relates to an automated information retrieval system, often referred to as an Internet or Web “search engine.” Typical search engines involve at least two specific components. First, the search engines have a database creation component that uses automated collection agents (i.e., software programs generally called “spiders”) to automatically traverse the Web to discover and collect accessible information source items independent of content. The term spider is understood here to include automated user agents, call utilities, Web robots, bots, autonomous and mobile agents dedicated to the function of automatically retrieving documents, pages, or resources either by traversing the Web or by some other means. In essence, spiders automatically traverse the Web's hypertext link structure, recursively retrieving documents, pages, or resources that are discovered and return the items (e.g., Web documents or document addresses (“URLs”)) to populate a confined data structure.
Second, the search engines provide a query function or component that allows an end-user to access the populated data structure and query that data structure to retrieve resource items based on content (i.e., content related to the supplied query). This second component is referred to herein as an information retrieval system, which refers to the data structure-based functions of storage, ordering, and presenting of previously discovered and collected information, as distinct from the processes of discovery and collection of data from the Web. Thus, using an information retrieval system that has been populated with resource items through the use of a spider, end-users may supply queries to the database and, although all of the Web pages that the spider discovers and collects are stored in an undifferentiated manner, the information retrieval system can present items that generally relate to the query to the end-user.
One particular drawback associated with typical search engines relates to the fact that since the data structure portion of the information retrieval system is populated with many items that have not been filtered for content, the results of an end-user query generally have a significant number of irrelevant items. One response to the lack of relevancy in search engine results has been the development of “Web directories.” The directories consist of manually created databases (as compared to the automatically created databases of information retrieval systems). People examine each page or resource and determine whether the resource should be included in the directory's database. Web directories are distinguished from search engines in that they only collect or accept content that is relevant to a topic or category within the directory. Although each directory typically has highly relevant resources, the throughput of manual processing creates directory databases that are unsatisfactorily small, on the scale both of the total Web and when compared to the size of Web search engine information retrieval system databases. Moreover, since people must manually perform the task of accepting or rejecting each and every resource, the cost of maintaining and updating the directories is significantly high.
With respect to either search engines or Web directories, an end-user supplies a query, or search criteria, in order to access information contained in a search engine information retrieval system database or a directory database. Typically, since both search engines and directories give greater weight to the keywords or phrases occurring at the beginning of a query, the order of the keywords or phrases may critically impact the amount of relevant information returned. For example, if a user was attempting to get information about his Volkswagen Golf automobile, the query “Golf and Volkswagen” may return two hundred sites dealing with the game of golf, but none dealing with automobiles. Conversely, the query “Volkswagen and Golf” may return one hundred sites dealing with automobiles, but still return one hundred irrelevant sites, dealing with the game of golf. The problem becomes worse when more keywords are added to the query. Therefore, a major problem with current search techniques is that even if a user manually inputs every combination of keywords in an attempt to retrieve relevant sites, the process may still present many irrelevant sites.
The primary reason for the presentation of irrelevant data relates to the limitations of the search engine's information retrieval system. As mentioned above, directories usually contain relevant information, but the amount of relevant information is small due to manual processing. Although it would be desirable for an information retrieval system to contain every document available by using an “unconstrained” spider, such spidering is impractical. In principle, the entire Web can be discovered and gathered using an unconstrained spider, however, in practice the process is intractable, and system resources are rapidly used up. For instance if a spider conducts a long unconstrained traversal, a large amount of memory resources are required to store the large amount of returned results. Problems associated with practical spidering of the Web include the large and highly variable number of links on different pages, the high level of self-referential and recursive linking architectures, and cyclical link paths. Furthermore, spiders do not differentiate documents based on topical content. Instead, each document that is traversed is returned to the database, creating a large, undifferentiated collection of items.
As mentioned above, if the search engine's spider is allowed to conduct an unconstrained search, an extremely large amount of information (both relevant and irrelevant) is retrieved and system memory is consumed quickly. Inasmuch as information retrieval systems have a limited memory capacity, a significant portion of the Web is left untouched by the search engines, and as a result, relevant information remains undiscovered by the user.
If possible, search engine and directory providers would like to populate their information retrieval system and directory databases with every bit of available information. Search engine and directory providers, however, must balance the desire to construct such large databases with the limitations imposed by system resources. Each provider may take a different approach to achieve this balance. As a result, each information retrieval system and directory database may be of a different size, populated with different information, and present the information to the user in different ways. Therefore, a query search entered on one search engine or directory may return different results than if the same query search was entered into a second search engine or directory. Ideally, a user would like to take advantage of the different methods for gathering, storing, and retrieving data used by each search engine or directory. Unfortunately, however, a user must typically enter each query combination into each search engine and/or directory. Furthermore, a user is required to manually filter all of the irrelevant items returned from each search engine and/or directory.
Additionally, typical search engines only provide a limited number of responses to a particular query. For example, many search engines only provide a user two hundred resources in response to a single query. The reason for the limited number of responses relates to the fact that a single user is typically unable to review hundreds or thousands of different resources that may potentially be returned in response to a query. Moreover, search engines typically have different relevancy rankings from other search engines according to predetermined criteria. Consequently, the same search on different search engines often produces different results. Thus, in order to increase the number of relevant results, multiple queries should be performed on multiple search engines.
Accordingly, what is needed in the art is a system and method that derives queries for different search engines that screens relevant information from the subject information or document to create queries applicable to the different search engines.