Through the years, there have been many different forms and methods used to identify and classify documents by using computer automation. Across many different industries, there exists a need to find other documents that are similar to a specified document. There also exists a need to take a document and assign it to one of a multiplicity of document categories. Generically, these are known as document search and document classification functions.
One example of the usefulness of document search and classification can be found in the industry associated with intellectual property, specifically patents. The ideas that describe the invention are contained in a document that may often contain figures and a textual description that allow others to understand its concepts. When applying for a patent, this document description is the basis for assigning it (classifying it) to one or more technology fields. As part of the process in obtaining the patent, other documents are searched to determine if the concepts of this new invention are indeed novel and unique.
Many of the algorithms used in performing the search and classification functions do this by using individual words. In some prior-art systems, a list of the unique words is obtained and the number of times each word occurs is noted. This list of words is often used as the basis of performing the document search. For example in U.S. Pat. No. 6,189,002, the frequency count (number of times the word is used in a document) is either the actual occurrence value if it occurs less than six times or equal to 6+log 10(Count) if greater than six times.
In evaluating how well a reference document matches other documents, a metric relating the reference document and the unknown document is typically required. In these types of prior-art systems, the word list extracted from the reference document is compared to the unknown document. The selection of the words that are extracted from the reference document and the algorithm used to compare the selected sets of words between the two documents are principal points of distinction for these existing systems.
Some of the methods known in the art for this selection of words include using all of the words found in the document, excluding a small subset of the most common words (e.g. the, and, or, is), or some number of times that the word occurs. The importance of each word may be the same or the importance of the word may be adjusted based on the number of times the word occurs in a document. Therefore, when the word list of the reference document is compared against the word list of the unknown document, these simple matching algorithms produce results that are often less than satisfactory.
It is also common that the reference document and the unknown documents are of significantly different length which causes problems when the matching algorithm is wholly or partly dependent upon the number of times a word occurs. In a longer document, the count (number of occurrences) for a word is likely to be higher than for the shorter unknown document. It is often difficult to establish the true length of the document when documents contain tables of data, code listings, figures, table of contents, glossary of terms, or an index showing where topics are in a document.
The problems mentioned above for comparing two documents, can also occur when a document must be classified into one or more specific categories. In the example of submitting a document during the patent application process, a determination must be made as to the major and minor patent classification categories. Should the document be a patent, it will be assigned to a primary and perhaps one or more secondary patent classifications. It is common for a patent to contain ideas that embody multiple patent classification categories and thus requires a decision to be made as to which of the ideas takes precedent in assigning the primary classification number.
A need exists to better identify and extract the significant words from a document and to provide a better technique in creating the metric that accurately reflects the degree of similarity of two or more documents.