1. Field of the Invention
The present invention generally relates to information retrieval in data processing systems and, more particularly, to a document matcher that matches new documents to a database of stored documents in order to find the most relevant matches.
2. Background Description
The problem of document matching against documents stored in a database has been addressed before, but the previous versions require substantial storage and computing resources. They employ much more complicated document representations and document matching algorithms.
U.S. Pat. No. 4,358,824 to Glickman et al. discloses an office correspondence storage and retrieval system. Keywords are selected from a document using a part of speech dictionary. Comparison between a document and a query uses the part of speech and position of occurrence in the document, the number of pages in a document and whether or not the document includes a month and year. The present invention does not use any of these features.
U.S. Pat. No. 4,817,036 to Millett et al. discloses a computer system and method for data base indexing and information retrieval. In this system, an inverted index of the document data base is computed and stored. Query keywords are looked up in the index and the bit strings are manipulated to produce an answer vector from which the matching documents can be found. Aside from the generic use of key words, this is entirely different from the present invention.
U.S. Pat. No. 5,371,807 to Register et al. discloses a method and apparatus for text classification. This invention describes a system in which the recognized keywords are used to deduce further facts about the document which are then used to compute category membership. The present invention does not use a fact data base for any purpose.
U.S. Pat. No. 5,418,948 to Turtle discloses concept matching of natural language queries with a database of document concepts. In this invention, query words are stemmed and sequences of stems are looked for in a phrase dictionary. The list of stemmed words and found phrases are used as query nodes in a query network which is matched against a document network. The present invention uses neither phrases nor query networks.
U.S. Pat. No. 5,694,559 to Hobson et al. discloses on-line help method and system utilizing free text query. After identifying query keywords, this invention performs disambiguation, and other forms of analysis. Each keyword is then associated with a concept. Each concept has a likelihood of being associated with a help topic. The present invention does not require analysis of identified keywords and does not have a defined set of concepts and probabilities associated with help topics.