1. Field of the Invention
The present invention generally relates to Information Retrieval, and more particularly to a method and system for finding user relevant documents having high specificity and relevancy, in much greater numbers than current methodologies.
2. Discussion of the Background
The number of documents and pages of information in every field now available through Internet searches of the World Wide Web (Web) has grown to prodigious numbers. Theoretically one can find almost anything conceivable on the Web. Practically, it has become increasingly difficult to find the precise information being sought; in part because of the volume of the information, and particularly because of the limited capabilities of current search engines. Search engine inherent limitations and proneness to inaccuracy in finding what the user is searching for is a common problem. This applies to Web searches as well as to searches in smaller intranet systems used by businesses and institutions.
The most common and familiar type of search engine is a key word driven Boolean search allowing the user to submit one or more key words. The search engine then looks for these key words within the database being searched. Boolean searches because of their very nature are very restrictive, often eliminating all documents that do not contain all the key words entered. The typical Boolean search method uses the “AND” operator and has been described as an exact match method. It makes no distinction between documents where one of a string of key words is missing, and documents where all key words are missing. All documents not containing all the key words would be eliminated from search results. This eliminates documents that would contain almost all of the key words and would be very relevant if found. At other times no results are found.
If instead the “OR” operator is used in the Boolean search method and a larger number of words are employed; the number of results or hits associated with any single word or phrase is usually large and collectively would be very large. The “OR” operator expands the size of the database to be searched rather than narrowing and making the search more specific. Using both the “AND” and “OR” operators may produce results where the “AND” operator alone would give no results. However in a string of numerous key words the introduction of the “OR” operator, the results would not contain the same cohesive string of key words compared to using the “AND” operator alone. Introduction of the “OR” operator also often renders the ranking algorithms ineffective.
A further problem of Boolean searches is that of the “precision rate” vs. the “recall rate. The precision rate being the proportion of documents in the total found which are relevant; while the recall rate is the proportion of relevant documents that were actually retrieved from the database being searched. If one desires greater precision and specificity one must narrow their search. One does this by including a greater number of key words to better define the target information. However, in doing so one will exclude more and more relevant documents when using a typical Boolean search methodology. This is because if any single word is not in a document the document is eliminated. Therefore if ten words were entered in a typical Boolean search and no documents contained all ten words, but numerous documents containing nine or eight words, the Boolean search would produce no results. No results that are close, almost, or nearly, in terms of the number of key words in a document are possible in a Boolean search. The nature of Boolean logic is well suited to the 0 and one, yes or no, binary system but is incapable of dealing with finding highly probable results in a search.
Because of the exact match nature of a typical Boolean search, when no results are produced, there is no way of knowing which word or words were the cause of the failure. As a result the searcher must repeat the search possibly eliminating some words and using the “OR” operator to try to refine his search in order to produce results. This can be a lengthy and tedious process and still only produce limited results.
In addition there is the problem of synonyms. Relevant documents can easily be overlooked because the document author and the searcher use different terms to describe the same thing. Including synonyms in a Boolean search increases the chances that no results will be found since as we increase the number of words we eliminate documents not containing all words.
Using the “OR” operator the number of results becomes difficult to manage effectively because of the volume of results and the large number of irrelevant documents.
Often complex algorithms using proximity analysis, past user preferences, frequency analysis, and other methodologies are used to attempt to sort or rank the hits in a relevant order. These methods have proven to be inefficient since frequently many irrelevant documents accompany relevant documents. This is particularly true when the number of key words is not large.
As explained earlier in a Boolean system, increasing the number of key words will enhance specificity reducing irrelevant documents; but at the same time relevant documents will be eliminated. Boolean searches commonly bring up a list of relevant and irrelevant documents with widely varying degrees of relevance. There might be an extremely relevant document in the list of documents searched but because of the use of the “AND” operator it may be excluded because one of many words is missing. In a Boolean search the only means of narrowing the search to find relevant documents is through the use of the “AND” operator. There is a need for a more effective method for a user to find all or a much larger portion of relevant documents within databases being searched.
There is a great need for a search engine that can overcome these drawbacks and provide the user with results that match more accurately the information being sought.
Accordingly, in illustrative aspects of the present invention there is provided a system, method, and computer program product for a search engine utilizing a large number of key words or phrases generating a much larger number of highly relevant documents.
Still other aspects, features, and advantages of the present invention are readily apparent from the following detailed description, by illustrating a number of illustrative embodiments and implementations, including a preferred mode contemplated for carrying out the present invention. The present invention is also capable of other and different embodiments, and its several details can be modified in various respects, all without departing from the spirit and scope of the present invention. Accordingly, the drawings and descriptions are to be regarded as illustrative in nature, and not as restrictive.