Data mining is the use of automated data analysis techniques to uncover previously undetected relationships among data items. The best known examples of data mining applications are in database marketing, wherein an analysis of the customer database, using techniques such as interactive querying, segmentation, and predictive modeling to select potential customers in a more precisely targeted way, in financial investment, wherein predictive modeling techniques are used to create trading models, select investments, and optimize portfolios, and in production manufacturing, wherein production processes are controlled and scheduled to maximize profit.
Data mining has been appropriate for these areas because, while significant amounts of data are present for analysis, the datasets are of a small enough nature that analysis can be performed quickly and efficiently using standard data mining techniques such as association rule mining (ARM), classification, and cluster analysis. This has not been the case with other data collection areas. For instance, such areas as bioinformatics, where analysis of microarray expression data for DNA is required, as nanotechnology where data fusion must be performed, as VLSI design, where circuits containing millions of transistors must be tested for accuracy, as spatial data, where data representative of detailed images can comprise millions of bits, and others present such extremely large datasets that mining implicit relationships among the data can be prohibitively time consuming with traditional methods.
The initial problem in establishing data mining techniques for these extremely large datasets is organizing the large amounts of data into an efficiently usable form that facilitates quick computer retrieval, interpretation, and sorting of the entire dataset or subset thereof. The organizational format of the data should take recognition of the fact that different bits of data can have different degrees of contribution to value, i.e., in some applications high-order bits along may provide the necessary information for data mining making the retention of all data unnecessary. The organizational format should also take recognition of the need to facilitate the representation of a precision hierarchy, i.e., a band may be well represented by a single bit or may require eight bits to be appropriately represented. As well, the organizational format need also take recognition of the need to facilitate the creation of an efficient, lossless data structure that is data-mining-ready, i.e., a data structure suited for data mining techniques.