A. Field of the Invention
The present invention relates generally to data processing and analysis systems and methods and, more particularly, to (1) a computer-based data processing and analysis system which performs data exploration, knowledge acquisition, and reasoning under conditions of uncertainty for discovering implicit or latent relationships in data and (2) a method of using the same.
B. Description of the Related Art
Data exploration (sometimes called xe2x80x9cdata miningxe2x80x9d) involves the development and use of tools that analyze large data sets in order to extract useful, but often hidden (or xe2x80x9clatentxe2x80x9d) information from them. Information extracted from a particular database can be used to identify patterns of characteristics (features) and groupings (classes) of samples in the data. If the feature patterns of samples in each class are sufficiently similar within that class and are sufficiently dissimilar to the overall feature patterns of the other classes, then the feature patterns of each class may be used to develop classification rules for separating the different classes within that domain. The resulting classification rules may then be used to predict to which class a new and unclassified sample may belong based upon that new sample""s feature pattern. A xe2x80x9cclassifier,xe2x80x9d or classification tool, is the culmination of such classification rules that are generated from input data called a training set.
Conventional classification techniques typically include some kind of data exploration method that derives the classification rules. Although many classification methods already exist, they are all affected by one or more of three factors: (1) lack of interpretability, (2) assumptions made about the data when building a classifier, and (3) data requirements. The first factor is a question of how semantically interpretable the classification rules and analysis results are. In some cases, such as chemical process monitoring, it is vital that a user be able to understand exactly what factors will allow discrimination between the classes. In other situations, however, only the result is of importance and, therefore, the semantic interpretability is not as important an influence on the choice of classification method. The second factor limits the usefulness of the resulting classifier if the assumptions made when applying the classification technique are inappropriate for the given data set. The third factor affects those classification methods that require a specific size data set, or require the classes to have equivalent properties in terms of membership number or other properties such as covariance structure.
For example, classifiers such as neural networks and soft independent modeling of class analogy (SIMCA) result in classification rules that are quite challenging to interpret semantically, and usually require large amounts of data. See B. Ripley, Pattern Recognition and Neural Networks (Cambridge University Press1996); and M. Sharaf et al., Chemometrics (Wiley 1986). Standard statistical discriminant analysis techniques, although fairly easy to interpret, inherently make assumptions about the structures of the underlying classes in the data which limits their validity and effectiveness when these assumptions cannot be justified with real-world data. See Ripley, supra; and G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition (Wiley 1992).
None of the classification techniques mentioned above performs reasoning under uncertainty (a term that, in this case, refers to providing a classification when not all of the evidence about a sample is either known or known with absolute certainty). Although many classification techniques have been developed using fuzzy logic for use in such situations, as with many other classification methods, the fundamental assumptions of fuzzy logic and the shape of the classes"" membership functions cannot always be justified with real-world data. Recently Bayesian networks (or belief networks) have been used as classifiers to perform classifications under conditions of uncertainty. See, e.g., P. Langley et al., Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI Press 1992); P. Langley and S. Sage, Proceedings of the Tenth National Conference on Uncertainty in Artificial Intelligence (Morgan Kaufmann 1994); D. Heckerman and C. Meek, Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (Morgan Kaufinann 1997); and D. Heckerman, 1 Data Mining and Knowledge Discussion 79 (1997). Unfortunately, Bayesian networks require making assumptions about the distributions of the underlying classes in the data, and are therefore not an optimum choice for many real-world applications, as can be seen from the results seen in C. Wellington and D. Bahler, Predictive Toxicology of Chemicals: Experiences and Impact of AI Tools (AAAI 1999).
An object of the invention is to provide a data exploration and analysis method that will discover implicit, or hidden, relationships in data, upon which classifiers may be constructed, without falling prey to limitations due to unfounded assumptions, lack of interpretability, or restrictive data requirements shown in conventional classification techniques and data exploration methods.
Additional objects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
To achieve the objects and in accordance with the purpose of the invention, as embodied and broadly described herein, the invention comprises a data processing and analysis system for discovering implicit relationships in data, including: means for inputting empirical data, expert domain knowledge, domain conditions, and new sample data; a computing means for receiving the empirical data and expert domain knowledge from the inputting means, the computing means having a memory means connected to a processing means, wherein the processing means stores the empirical data and expert domain knowledge in the memory means, pre-processes the empirical data and expert domain knowledge, selects and extracts features from the empirical data and expert domain knowledge, generates correlation matrices, derives conditional probability tables, calculates posterior probabilities, creates a domain model, incorporating any user-defined domain conditions, for storage in the memory means, provides an output signal representative of the domain, and provides an output signal representing a classification of the new sample data; and means for receiving the output signal representative of the domain and the output signal representing the classification of the new sample data, for graphically displaying the representation of the domain, and for displaying the classification of the data set.
To further achieve the objects, the present invention comprises a data processing and analysis method in a computer system for discovering implicit relationships in data using an input means and a display means connected to a computing means having a memory means connected to a processing means, the method including the steps of: inputting empirical data, expert domain knowledge, domain conditions, and sample data to the computing means, via the input means; receiving the empirical data and expert domain knowledge in the computing means; utilizing the processing means to store the empirical data and expert domain knowledge in the memory means, to pre-process the empirical data and expert domain knowledge, to select and extract features from the empirical data and expert domain knowledge, to generate correlation matrices, to derive conditional probability tables, to calculate posterior probabilities, to create a domain model, incorporating any user-defined domain conditions, for storage in the memory means, to provide an output signal representative of the domain, and to provide an output signal representing a classification of the new sample data; receiving the output signal representative of the domain and the output signal representing the classification of the new sample data in the display means; graphically displaying the representation of the domain on the display means; and displaying the classification of the data set on the display means.
To still further achieve the objects, the present invention comprises a computer program product for use with a computer system for directing the system to discover implicit relationships in data, the computer program product including: a computer readable medium; means, provided on the computer readable medium, for directing the system to receive empirical, expert domain knowledge, domain conditions, and new sample data; means, provided on the computer readable medium, for storing the empirical data and expert domain knowledge in the computer readable medium; means, provided on the computer readable medium, for pre-processing the empirical data and expert domain knowledge; means, provided on the computer readable medium, for selecting and extracting features from the empirical data and expert domain knowledge; means, provided on the computer readable medium, for generating correlation matrices; means, provided on the computer readable medium, for deriving conditional probability tables; means, provided on the computer readable medium, for calculating posterior probabilities; means, provided on the computer readable medium, for creating a domain model, incorporating any user-defined domain conditions, for storage in the computer readable medium; means, provided on the computer readable medium, for providing an output signal representative of the domain, and an output signal representing a classification of the new sample data; means, provided on the computer readable medium, for receiving the output signal representative of the domain and the output signal representing the classification of the new sample data in the display means; means, provided on the computer readable medium, for graphically displaying the representation of the domain on a display means; and means, provided on the computer readable medium, for displaying the classification of the data set on the display means.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.