1. Field of the Invention
The present invention generally relates to Information Retrieval, and more particularly to a method and system for finding user relevant documents using a Boolean search engine configured such that relevant results are always found.
2. Discussion of the Background
The number of documents and pages of information in every field now available through Internet searches of the World Wide Web (Web) has grown to prodigious numbers. Theoretically one can find almost anything conceivable on the Web. Practically, it has become increasingly difficult to find the precise information being sought; in part because of the volume of the information, and particularly because of the limited capabilities of current search engines. Search engine inherent limitations and proneness to inaccuracy, or failure to find what the user is searching for, is a common problem. This applies to Web searches as well as to searches in smaller intranet systems used by businesses and institutions.
The most common and familiar type of search engine is a key word driven Boolean search allowing the user to submit one or more key words. The search engine then looks for these key words within the database being searched. Boolean searches because of their very nature are very restrictive, often eliminating all documents that do not contain all the key words entered. The typical Boolean search method uses the “AND” operator, and has been described as an exact match method. It makes no distinction between documents where one of a string of key words is missing, and documents where all key words are missing. All documents not containing all the key words would be eliminated from search results. At other times no results are found because although there may be a large number of documents containing all key words to that point in the search, a particular key word is not found in the remaining documents because of poor key word selection, or because a different word was used to describe what is being sought.
If instead the “OR” operator is used in the Boolean search method and a larger number of words are employed; the number of results or hits associated with any single word or phrase is usually large and collectively would be very large. The “OR” operator expands the size of the database to be searched rather than narrowing and making the search more specific. Using both the “AND” and “OR” operators may produce results where the “AND” operator alone would give no results. However in a string of numerous key words the introduction of the “OR” operator, the results would not contain the same cohesive string of key words compared to using the “AND” operator alone. Generally it is preferable to use the “AND” operator without having to use the “OR” operator. Introduction of the “OR” operator also often renders the ranking algorithms ineffective.
A further problem of Boolean searches is that of the “precision rate” vs. the “recall rate. The precision rate being the proportion of documents in the total found which are relevant; while the recall rate is the proportion of relevant documents that were actually retrieved from the database being searched. If one desires greater precision and specificity one must narrow their search. One does this by including a greater number of key words to better define the target information. However, in doing so one will exclude more and more relevant documents when using a typical Boolean search methodology. This is because if any single word is not in a document the document is eliminated. Therefore if ten words were entered in a typical Boolean search and no documents contained all ten words, but numerous documents containing nine or eight words, the Boolean search would produce no results. No results that are close, almost, or nearly, in terms of the number of key words in a document are possible in a normal Boolean search. The nature of Boolean logic is well suited to the 0 and 1, yes or no, binary system but is incapable of dealing with finding highly probable results in a search.
Because of the exact match nature of a typical Boolean search, when no results are produced, there is no way of knowing which word or words were the cause of the failure. As a result the searcher must repeat the search possibly eliminating some words and using the “OR” operator to try to refine his search in order to produce results. This can be a lengthy and tedious process and still only produce limited results.
In addition there is the problem of synonyms. Relevant documents can easily be overlooked because the document author and the searcher use different terms to describe the same thing. Including synonyms in a Boolean search increases the chances that no results will be found since as we increase the number of words we eliminate documents not containing all words.
There also is the problem that arises due to misspelled words. Some systems can recognize misspelled words and offer corrections, but these spelling algorithms are not always effective. In addition many systems do not have the spelling correction capability. Because the word does not exist, the search is terminated with no results.
Using the “OR” operator the number of results frequently becomes difficult to manage effectively because of the volume of results, and the large number of irrelevant documents.
Often complex algorithms using proximity analysis, past user preferences, frequency analysis, and other methodologies are used to attempt to sort or rank the hits in a relevant order. These methods have proven to be inefficient since frequently many irrelevant documents accompany relevant documents. This is particularly true when the number of key words is large.
As explained earlier in a Boolean system, increasing the number of key words will enhance specificity reducing irrelevant documents; but at the same time relevant documents will be eliminated. Boolean searches commonly bring up a list of relevant and irrelevant documents with widely varying degrees of relevance. There might be an extremely relevant document in the list of documents searched but because of the use of the “AND” operator it may be excluded because one of many words is missing. In a Boolean search the only means of narrowing the search to find relevant documents is through the use of the “AND” operator. There is a need for a more effective method for a user to find all or a much larger portion of relevant documents within databases being searched.
There is a great need for a search engine that can overcome these drawbacks and provide the user with results that match more accurately the information being sought.
Accordingly, in illustrative aspects of the present invention there is provided a system, method, and computer program product for a Boolean search engine utilizing a large number of key words or phrases, configured in a manner to eliminate search termination caused by the absence of one or more key words or phrases. Searches conducted using the method and system described will always produce the optimum achievable result with the key words employed in the search.
Still other aspects, features, and advantages of the present invention are readily apparent from the following detailed description, by illustrating a number of illustrative embodiments and implementations, including a preferred mode contemplated for carrying out the present invention. The present invention is also capable of other and different embodiments, and its several details can be modified in various respects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature, and not as restrictive.