During the last decade, there is an explosive growth in the capabilities to both generate and collect data. Advances in data collection, widespread use of bar codes for most commercial products, and the computerization of many business and government transactions have flooded us with information. In calendar year 2000, it is estimated that about 1 Exabyte (=1 million Terabyte) of data was generated and the trend is accelerating. The data collected could be a source of valuable information. However, finding valuable information and synthesizing useful knowledge hidden in them is a non-trivial task. Without adequate means to explore the large amount of data, the data becomes useless and the databases become data “dumps”.
There is an urgent need for new techniques and tools that can intelligently and automatically assist a user in transforming data into useful knowledge. The emerging field of data mining and knowledge discovery in databases (KDD) creates significant research and product interests. Data mining can be defined as “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data”. Statistics, databases, machine learning, artificial intelligence and visualization techniques are applied in an attempt to discover and present knowledge in a form which is easily comprehensible to a human. Data mining related researches span classification and clustering, trend and deviation analysis, dependency modeling, integrated discovery systems, next generation database systems, visualization, and application case studies. Many tools and services are commercially available such as Decisionsite from Spotfire (Spotfire, http://www.spotfire.com/products/decision.asp), Insightful Miner from Insightful (Insightful, http://www.insightful.com/products/product.asp?PID=26), Clementine from SPSS (SPSS, http://www.spss.com/spssbi/clementine/index.htm), VisuaLinks from Visual Analytics (Visual Analytics, Inc. www.visualanalytics.com), Enterprise Miner from SAS (SAS Institute Inc. www.sas.com). However, there is only limited success in the adoption of the data mining technologies and tools for practical applications.
The prior art approach falls into two extremes. In one extreme, it relies heavily on a human's ability to search the database, understand detailed meaning of feature attributes and ability to comprehend statistics and learning methods. We call this approach the human dominated method. A visual data mining method (Keim Daniel, “Information Visualization and Visual data Mining”, IEEE Trans. on Visualization and Computer Graphics, Vol. 7, No 1, Jan-March 2002) was developed that uses special visualization techniques to facilitate users direct involvement in the data mining process. Visual data mining techniques prioritize and display relations between data fields to harness the enormous human visual information processing capacity in order to rapidly traverse large information spaces and facilitate comprehension with reduced anxiety. However, it falls short of empowering users to harness vast data for efficient discovery of novel and important information. For noisy and inhomogeneous data sets it becomes ineffective because it cannot help the human separate strong data from weak data or exhibit the effects of strong or weak decisions. Unfortunately, some of the most important opportunities for data mining (i.e. geology, natural resource exploration, biomedical drug discovery, experimental physics) are characterized by weak and noisy data. This results in inconsistent data mining performance and it is difficult to create highly novel concepts and knowledge. This approach is also extremely inefficient when the database being explored is large.
The other extreme of the prior art approach heavily relies on a computer to automatically generate rules and discover knowledge from data (Ian H. Witten, Eibe Frank “Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations”, Morgan Kaufmann, October 1999.) We call this approach the computer dominated method. The automatic method relieves a human of the responsibility for deciding on the significance and relationships between the data. Unfortunately, the methods are very limited and rely on assumptions, which are often not valid in practical applications. When a mismatch exists between the assumptions and application situations, the automatically generated knowledge could be worse than useless since the knowledge indicated is not valid and may mislead the human. As application demand and data complexity are increasing, a general-purpose fully automatic data mining/knowledge discovery technology is not in sight. The path to success is the integration of human direction with computer inputs from automatic learning results. Existing software that allows users to effectively create data models and reach conclusions with measurable confidence are created only through arduous icon based programming tasks, and the data models are difficult to modify and understand. This interaction is reluctant, slow, costly and manual. Furthermore, most of the automatic learning methods do not support incremental update. So human feedback is not easily incorporated to refine the automatically generated knowledge. This invention bridges the gaps between the human dominated method and the computer dominated method. It lays the foundation for next generation integrated intelligent human/computer interactive data mining.
The effectiveness of human data mining could be greatly improved if the visualization of data could be effectively ranked and clustered according to the strength of the data and the strength of decision processes. Furthermore, counter examples could be shown through a contrasting approach that facilitates human discovery of subtle differences. The hierarchic structure of the regulation tree of this invention naturally maps to information granularity. This is an ideal representation that supports multi-level abstraction data mining process: overview, zoom and filter, details-on-demand.