This disclosure relates generally to data processing, and more particularly to a system and method for managing a knowledge base. In an adaptive workflow modeling project the domain knowledge model needs to be flexible and adaptive as new information becomes available, For example, in a production printing workflow domain, a comprehensive knowledge model captures multiple layers of semantics about user constraints, a wide range of product offerings and their capabilities, production printing workflow patterns, business partners and competitors, etc. The knowledge model may be built on current subject matter expertise in five market defined production workflow environments: book printing, print-on-demand, personal communication, transactional and promotional printing, and unified offset and digital printing. However, as the market and technology constantly evolve, new products or devices become available, new partnerships are formed around the world, and new markets and competitors emerge.
Accordingly, in an adaptive knowledge base system, as information evolves new instances of knowledge must be entered into the repository or knowledge base without redundancy. Algorithms exist for determining if a knowledge instance to be entered into the knowledge base already exists for avoiding instance redundancy. A number of algorithms for preventing entry of a redundant information instance is described by A. E. Monge and C. P. Elkan in “The Field Matching Problem: Algorithms and Applications”, Proceedings Of the 2nd International Conference of Knowledge Discovery and Data Mining, pages 267-270, 1996. Specifically, Monge et al, describes algorithms for finding matching information which indicates redundancy, including a basic field matching algorithm for string matching and a recursive algorithm for finding abbreviations which match a non-abbreviated knowledge instance. The basic field matching algorithm does not handle abbreviation, and the recursive algorithm has quadratic time complexity.
Another algorithm for preventing entry of a redundant information instance is described by Mong Li Lee, Hongjun Lu, Tok Wang Ling and Yee Teng Ko in “Cleansing Data for Mining and Warehousing”, Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA), Florence, Italy, August 1999, for finding matching information and determining the existence of redundancy. However, the algorithm described does not take character sequence into account.
In a process known as rule mining, patterns, relationships and associations within a knowledge base are uncovered. The knowledge base holds a set of values or items, wherein a subset of the database including a particular set of items is known as an itemset. The percentage of occurrences of a particular itemset is known as support for the itemset. Itemsets whose support exceeds a predetermined threshold are known as large itemsets. The ratio of frequency of occurrence of a subset of the large itemset to the frequency of occurrence of the large itemset in the knowledge base is used for establishing an associate rule, where a confidence factor for the rule is related to the strength of the rule.
The support and confidence factors associated with established association rules are indicative of patterns, relationships and associations within the knowledge base. As new knowledge instances are added to the knowledge base, new association rules must be established and the association rules must be must be updated. Algorithms for rule mining are described in R. Agrawal, T. Imielinski, and A. Swami in “Mining Association Rules Between Sets of Items in Large Databases”, Proceedings Of The ACM SIGMOD Conference on Management of Data, Washington, D.C., May 1993; and by M. Houtsma and A. Swami in “Set-Oriented Mining of Association Rules”, Research Report RJ 9567, IBM Almaden Research Center, San Jose, Calif., October 1993. However, the described algorithms are inefficient in that the ratio of potential large itemsets to the final output of itemsets from which the rules are derived is exceedingly large.
A well known Apriori algorithm is described by R. Agrawal, R. Srikant in “Fast algorithms for mining association rules”, Proceedings Of the 20th International Conference in Very Large Databases, Santiago, Chile, September 1994 which reduces the number of itemsets that need to be counted for generating large itemsets. The Apriori algorithm makes multiple passes over data stored in the knowledge base. In the first pass, the support values of individual itemsets are counted and decided whether they are large. In subsequent passes, the itemsets to be processed include only the large itemsets found in the previous pass. For each pass a new set of potentially large itemsets, known as candidate itemsets, is generated, where the candidate itemsets are used as seeds for the next pass. The process continues until no new large datasets are found. However, the Apriori algorithm is inefficient in that candidate itemsets are typically formed of items that would not be combined into an actual set.