The invention relates generally to the field of classification of documents contained within large enterprise databases. More particularly, the invention relates to classification engines that classify documents by performing similarity searches to match classification profile data to data found in external databases containing known class data.
Information resources often contain large amounts of information that may be useful only if there exists the capability to segment the information into manageable and meaningful packets. Database technology provides adequate means for identifying and exactly matching disparate data records to provide a binary output indicative of a match. However, in many cases, users wish to classify information contained in databases based on inexact but similar attributes. This is particularly true in the case where the database records may be incomplete, contain errors, or are inaccurate. It is also sometimes useful to be able to narrow the number of possibilities for producing irrelevant classifications reported by traditional database classification programs. Traditional classification methods that make use of exact, partial and range retrieval paradigms do not satisfy the content-based retrieval requirements of many users.
Many existing classification systems require significant user training and model building to make effective use the system. These models are very time-consuming to generate and to maintain. Another disadvantage with many model-based classification systems is that they appear as a black box to a user and only provide the resulting class or grade without any explanation of how the resultant conclusion was reached. The information regarding the conclusion is valuable if additional analysis is required to validate the conclusion. Some classification systems use a large set of complex rules that process data directly. These rules are difficult to generate and even more difficult to maintain because they contain many complex attributes.