A core aspect of knowledge is to apply observations of objects in different environments in the identification of relationships between the observed objects or properties associated with the observed objects and to use this understanding in the identification of cause effect relationships. With the amount of electronic information exponentially increasing, there is a growing need for extracting knowledge through data mining.
The extraction of knowledge through data mining requires identification of relationships among frequently large number of heterogeneous observations regarding attributes, characteristics, color, shape, virtues, merits, capacities, features, diagnostic fingerprints, properties, dependencies, actions, behavior, qualities, nearest neighbor interactions, functional relationships, functions, forces, purpose, effects that are associated with large numbers of objects often having very different units of measurements and scales (molecules, proteins, cells, tissues, organs, organisms, plants, animals, humans, planets, solar systems) and to discern amongst them observations that are meaningful for context recognition (the discernment of close relationships) from observations that are not meaningful for context recognition. [Rajaraman, A.; Ullman, J. D. (2011). “Data Mining”. Mining of Massive Datasets. pp. 1-17.] Likewise the extraction of knowledge trough data mining also requires identification of objects sharing characteristic observation and the translation of this information into knowledge. In this respect, knowledge is a continuum ranging from the implicit end of the continuum such as statistical co-occurrences between objects and observations to more explicit knowledge such as causal relations.
Thus a first problem in knowledge extraction through data mining is addressed by the development of methods for the discernment of infrequent but meaningful observations from observations that are frequent but meaningless for context or group recognition. This is by no means an easy task and several methods have been developed for addressing this problem [(Yang, Guang-Zhong, and Magdi Yacoub. “Body sensor networks.” (2006): 500); Robert Taaffe et al Displaying demographic information of members discussing topics in a forum U.S. Pat. No. 8,462,160)].
A second and more difficult problem in the extraction of knowledge through data mining is the identification of relationships between observations that have very different scales or unit of measurements (Fayyad, Usama, Gregory Piatetsky-Shapiro, and Padhraic Smyth. “From data mining to knowledge discovery in databases.” Al magazine 17.3 (1996): 37.).
A third and even more difficult problem in extracting knowledge through data mining is the identification of cause effect linkage across disparate environments which require the quantification of similarities between observation having different unities of measurements and scales (Fliri, Anton F., William T. Loging, and Robert A. Volkmann. “Analysis of system structure-function relationships.” ChemMedChem 2.12 (2007): 1774-1782).
A fourth problem arises with information gaps which are very common. To remedy this particular problem, methods have been explored for automatically inferring cause effect relationships. These approaches are usually divided into two main categories: proxy methods and natural language processing (NLP) based methods. Proxy methods attempt to use secondary observations associated with objects or events to infer cause effect relationships. For example, Burton and Simonitas used this approach for inferring cause effect relationships of medications using drug indication data. [Burton, M. M., Simonaitis, L., and Schadow, G. Medication and indication linkage: a practical therapy for the problem list?. Proc AMIA Symp. 2008; 86-90]. Lin and Haug described a more sophisticated system based on Bayesian networks [Lin, J. H. and Haug, P. J. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. J Biomed Inform. 2008; 41: 1-14]; Bayesian networks have the ability to model uncertainty arising with information gaps and analysis are based on a graphical formalism; wherein each variable is modeled as a node and causal relationship between two variables may be represented as a directed arc. For each node, a conditional probability table or formula is supplied that represents the probabilities of each value of this node, given the conditions of its parents. Application of this tool usually requires expert knowledge and training sets.
In addition to proxy methods, a variety of Natural Language Processing (NPL) methods have also been proposed. Such systems extract information from unstructured text such as clinical study progress notes and use association statistics for ascertaining cause effect relationships; however, one short coming of this approach is that NPL tools need customized dictionaries which limit the usefulness of NLP applications in broad text mining based cause effect analysis.