1. Field of the Invention
This invention relates generally to a method of processing queries, and more particularly to a method of identifying one or more information units in response to a multiple keyword query of a search space.
2. Description of the Related Art
Since the structure of the World Wide Web (the Web) encourages hypertext and hypermedia document authoring (e.g. HTML and XML), Web authors tend to create documents which are composed of multiple pages which are connected via links. A Web document or XML database record may be authored in multiple ways. For example, a document or record may have all of its information contained on a single physical page, or, more commonly, the document may be segmented into multiple parts such as a main page and one or more separate pages containing related information which are linked to the main page. Each of the related pages may, likewise, contain links to additional related pages. In response to keyword queries of the internet or database search spaces, existing search engines return only those physical pages which contain all of the keywords in a given query. Focusing the search upon individual pages in the search space, however, is a significant shortcoming which causes conventional search engines to return deficient results in response to queries comprising a plurality of keywords.
For example, in an attempt to locate the internet sites which feature announcements for recent or upcoming conferences or conventions related to the Web, a user may issue a query which contains, say, three keywords: “web”; “conference”; and “topic”. A typical internet search engine which is issued such a multiple keyword query reports results which are surprisingly inaccurate, omitting many of the most relevant Web sites. The primary reason for such inaccurate reporting (“false drops”) is that the contents of the HTML documents which make up the Web are often distributed among multiple physical pages, and conventional internet search engines do not take this document structure into account when conducting the search. In accordance with the present state of the art related to Web indexing and searching technology, existing search engines retrieve only those individual pages which contain each and every keyword in the query.
The output of a conventional search engine is a list of individual physical pages which satisfy the query by containing all of the requested keywords. If an individual page on the Web or in an XML database does not contain complete information (i.e., all the keywords) for answering the query, however, the page is generally not reported by the search engine. The deficient page is “dropped” even though the document of which that page is a part may be very relevant when all of its various linked pages are viewed as a whole.
Some current search engines provide for what is known as query relaxation. For example, the search engine may be instructed to identify and to report, on the one hand, pages which contain less than all of the keywords in the query, or, on the other hand, pages which contain words which are only similar to keywords in the query, rather than exact matches. Typically, if such pages are reported by the search engine in response to a relaxed query, they are generally assigned a lower “rank” or “relevance” than pages which fully satisfy the query. Such a rank may be assigned according to the number of missing or merely similar words in the page, or according to the degree of similarity between the existing word and the missing keyword. Even in the case of the most sophisticated search engines presently employing query relaxation options, however, the search is conducted only for individual physical pages in the search space. No consideration is given to the content of the neighboring pages to which the searched page is linked. By limiting the nature of the search to individual pages rather than considering the structure of the documents searched and the relationships between linked pages, the typical search engine misses many relevant pages, especially in a relaxed query situation where associations between pages and their relative proximity can be very important in the determination of relevance short of a perfect solution to the query.
Also, in many cases, search engines are adapted to accommodate altered queries in the form of elimination, addition, or substitution of keywords in a subsequent search of differing scope. An altered query can direct the search engine to identify, on the one hand, more pages where the original search proved uninformative, or, on the other hand, fewer pages where the original search returned an overwhelming amount of information. Such a dynamic process of altering the keywords in the query responsive to the reported results of the original query is an important feature which should be incorporated into every search engine, since this facilitates refining the search and consequently identifying the most useful information in the search space.
As the Web becomes larger and its use becomes even more prevalent than it is today, the search engine chosen for any given search will be required to sort through correspondingly more information. Consequently, efficiency and minimization of inaccurate responses in Web searches will increase in importance, if the searches are to retain any utility at all. Those searching the internet want the search engine to report the most relevant results with little or no extraneous information. Taking into account the structure of the search space, the search engine should minimize unwarranted or false drops of legitimately relevant material by distinguishing pages, as well as combinations of linked pages, which are truly relevant from those pages which should rightfully be dropped as less relevant.
An effective search engine can recognize that one page which does not contain every keyword in a particular query, but which is linked to other pages which contain other keywords, may still be relevant in combination with the pages to which it is linked. Such a situation is common given the nature of the internet, XML databases, and the structure of their documents. The combined set of pages should be identified as a relevant information unit, but such combinations of pages are not considered by existing search engines which only examine the contents of individual pages and ignore the relationships between pages.
There has been a continuing and growing need, therefore, for a method of processing keyword queries of vast search spaces, such as the Web or an XML database, which takes into account the way in which the information within those search spaces is authored and arranged. Consideration of the structure of HTML and XML documents, as well as the interrelationship between their pages, is crucial with respect to accurate and efficient information retrieval in large, computer-based search spaces.