1. Field of Endeavor
The present invention relates to data mining and more particularly to parallel object-oriented data mining.
2. State of Technology
U.S. Pat. No. 5,692,107 for a method for generating predictive models in a computer system by Simoudis et al, patented Nov. 25, 1997 provides the following information, xe2x80x9cAccurate forecasting relies heavily upon the ability to analyze large amounts of data. This task is extremely difficult because of the sheer quantity of data involved and the complexity of the analyses that should be performed. The problem is exacerbated by the fact that the data often resides in multiple databases, each database having different internal file structures. Rarely is the relevant information explicitly stored in the databases. Rather, the important information exists only in the hidden relationships among items in the databases. Recently, artificial intelligence techniques have been employed to assist users in discovering these relationships and, in some cases, in automatically discovering the relationships. Data mining is a process that uses specific techniques to find patterns in data, allowing a user to conduct a relatively broad search of large databases for relevant information that may not be explicitly stored in the databases. Typically, a user initially specifies a search phrase or strategy and the system then extracts patterns and relations corresponding to that strategy from the stored data. These extracted patterns and relations can be: (1) used by the user, or data analyst, to form a prediction model; (2) used to refine an existing model; and/or (3) organized into a summary of the target database. Such a search system permits searching across multiple databases. There are two existing forms of data mining: top-down; and bottom-up. Both forms are separately available on existing systems. Top-down systems are also referred to as xe2x80x9cpattern validation,xe2x80x9d xe2x80x9cverification-driven data miningxe2x80x9d and xe2x80x9cconfirmatory analysis.xe2x80x9d This is a type of analysis that allows an analyst to express a piece of knowledge, validate or validate that knowledge, and obtain the reasons for the validation or invalidation. The validation step in a top-down analysis requires that data refuting the knowledge as well as data supporting the knowledge be considered. Bottom-up systems are also referred to as xe2x80x9cdata exploration.xe2x80x9d Bottom-up systems discover knowledge, generally in the form of patterns, in data. Existing systems rely on the specific interface associated with each database, which further limits a user""s ability to dynamically interact with the system to create sets of rules and hypotheses than can be applied across several databases, each having separate structures. For large data problems, a single interface and single data mining technique significantly inhibits a user""s ability to identify all appropriate patterns and relations. The goal of performing such data mining is to generate a reliable predictive model that can be applied to data sets. Furthermore, existing systems require the user to collect and appropriately configure the relevant data, frequently from multiple and diverse data sources. Little or no guidance or support for this task is produced. Thus, there remains a need for a system that permits a user to create a reliable predictive model using data mining across multiple and diverse databases.xe2x80x9d
U.S. Pat. No. 5,758,147 for efficient information collection method for parallel data mining by Chen et al, patented May 26, 1998 provides the following information, xe2x80x9cThe importance of database mining is growing at a rapid pace. Progress in bar-code technology has made it possible for retail organizations to collect and store massive amounts of sales data. Catalog companies can also collect sales data from the orders they receive. A record in such data typically consists of the transaction date, the items bought in that transaction, and possibly the customer-id if such a transaction is made via the use of a credit card or customer card. Analysis of past transaction data can provide very valuable information on customer buying behavior, and thus improve the quality of business decisions such as: what to put on sale; which merchandise should be placed on shelves together; and how to customize marketing programs; to name a few. It is, however, essential to collect a sufficient amount of sales data before any meaningful conclusions can be drawn therefrom. It is therefore important to devise efficient methods of communicating and mining the xe2x80x98goldxe2x80x99 in these often enormous volumes of partitioned data. The most important data mining problem is mining association rules. By mining association rules it is meant that given a database of sales transactions, the process of identifying all associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction. It is known that mining association rules can be decomposed into two subproblems. First, all sets of items (itemsets) that are contained in a sufficient number of transactions above a minimum (support) threshold are identified. These itemsets are referred to as large itemsets. Once all large itemsets are obtained, the desired association rules can be generated therefrom in a straightforward manner. Database mining in general requires progressive knowledge collection and analysis based on a very large transaction database. When the transaction database is partitioned across a large number of nodes in a parallel database environment, the volume of inter-node data transmissions required for reaching global decisions can be prohibitive, thus significantly compromising the benefits normally accruing from parallelization. It is therefore important to devise efficient methods for mining association rules in a parallel database environment.xe2x80x9d
U.S. Pat. No. 5,787,425 for an object-oriented data mining framework mechanism by Joseph Phillip Bigus, patented Jul. 28, 1998 provides the following description, xe2x80x9cThe development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely sophisticated devices, capable of storing and processing vast amounts of data. As the amount of data stored on computer systems has increased, the ability to interpret and understand the information implicit in that data has diminished. In the past, data was stored in flat files, then hierarchical and network data based systems, and now in relational or object oriented databases. The primary method for analyzing that data has been to form well structured queries, for example using SQL (Structured Query Language), and then to perform simple aggregations or hypothesis testing against that data. Recently, a new technique called data mining has been developed, which allows a user to search large databases and to discover hidden patterns in that data. Data mining is thus the efficient discovery of valuable, non-obvious information from a large collection of data and centers on the automated discovery of new facts and underlying relationships in the data. The term xe2x80x9cdata miningxe2x80x9d comes from the idea that the raw material is the business data, and the data mining algorithm is the excavator, shifting through the vast quantities of raw data looking for the valuable nuggets of business information. Because data can be stored in such a wide variety of formats and because the data values can have such a wide variety of meanings, data mining applications have in the past been written to perform specific data mining operations, and there has been little or no reuse of code between application programs. Thus, each data mining application is written from scratch, making the development process long and expensive. Although the nuggets of business information that a data mining application discovers can be quite valuable, they are of little use if they are expensive and untimely discovered. Returning to the mining analogy, even if gold is selling for $900 per ounce, nobody is interested in operating a gold mine if it takes two years and $901 per ounce to get it out of the ground.xe2x80x9d
U.S. Pat. No. 6,049,861 for locating and sampling of data in parallel processing systems by Bird et al, patented Apr. 11, 2000, provides the following information, xe2x80x9cParallel processing techniques are known, in which a plurality of data processing units are provided and a separate processing unit is assigned, for example, to its own mutually exclusive set of local data items to process. This can greatly reduce the overall processing time as compared with serial processing. The xe2x80x98nodesxe2x80x99 of a parallel processing system are the separate processing units, which each have their own processor and their own storage (or at least access to shared storage). Two models exist for processing data which is shared across a plurality of nodes of a parallel processing system. That is, where a dataset is to be processed in parallel, it is loaded into the storage of the plurality of parallel processing units of the system. In a first one of these models, known as the xe2x80x98master-slavexe2x80x99 model, processing is under the control of a master node, which may have its own share of the data. There is generally no more than one master node. The other nodes are referred to as slaves. In the second model, there is generally no one node which is in controlxe2x80x94all nodes are communicating with each other in an xe2x80x98any-to-anyxe2x80x99 model. With both of these models, if information is to be extracted from a dataset by selecting data items in a specific sequence and performing operations on the selected data, while ensuring adequate coverage of the data on each of the nodes, then a fast and efficient method is required for locating the required data items. One possible method of locating specific data items within a dataset which is shared across multiple nodes involves polling of all the individual nodes. A first node (generally a controller node) sends a query to all nodes to determine which has, say, item number 15 of the set of data items. One of the nodes should reply with a confirmation that it has this required item. These inter-node communication steps are repeated for each required data item. However, such communication between the nodes entails both undesirable overheads and delays. Furthermore, associated with such inter-node communication is the necessity for status and error checking plus corrective operations to ensure that any communication failures cannot result in out-of-step processing. This entails a significant additional processing overhead. It is thus desirable to avoid any unnecessary communication between the nodes and so a method and a system are required which are not reliant on polling of individual nodes to determine the location of a required data item. Although polling has these disadvantages, there is also a significant problem with locating and sampling of data items in a parallel system if polling is not used. Difficulties arise because the locations of data items within a dataset which is shared across a number of nodes are dependent on the number of nodes available (or the number selected from the available nodes) for performance of a particular operation and on the chosen type of data partitioning, both of which may be subject to change. The number of nodes across which the dataset is shared may vary, for example, because a number of nodes which were available when an operation was performed for a first time may be unavailable when the operation is subsequently re-run. The data may also be partitioned in different ways across the nodes according to a data analyst""s selection. For example, data items may be striped across a number of nodes or each node may hold a contiguous block of data. The analyst may wish to change the partitioning of the dataset across the nodes when an operation is repeated (for example, because of temporal trends identified when the operation was first performed). Thus, each time a particular operation is repeated by the parallel processing system, data items may be located on different nodes than when the operation was previously performed. This makes locating of a particular data item and reproducible sampling of the dataset without polling of all nodes difficult. A second alternative which may be considered is to provide a look-up mapping table on each node which identifies the items held there (for example, listing their global item numbers within the dataset as a whole and corresponding local item numbers). A master node or every node of the system can be provided with a full list of which nodes hold which items. This is unacceptable, since for any large size database where data mining is likely to be used the data item location tables will also be very large and will consume far too much of the available storage space. Also, generating the look-up tables entails significant overhead. If efficient reproducible sampling is to be achieved, then there is a need for methods and systems which enable locating of particular selected data items despite any changes to the partitioning of the data set across a variable number of nodes. No method or system has previously been made available which provides efficient automatic determination by a single node of a parallel processing system of the location of items of a dataset which is shared across the system nodes, which does not involve polling of other nodes and which takes account of changes to the data partitioning.xe2x80x9d
The present invention provides a data mining system that uncovers patterns, associations, anomalies and other statistically significant structures in data. The system comprises reading and displaying data files with the data files containing objects that have relevant features. The objects in the data files are identified. Relevant features for the objects are extracted. Patterns among the objects are recognized based upon the features.
An embodiment of the invention was successfully tested in the field of astrophysics where vast quantities of data are gathered during surveys of the sky. The embodiment was tested in examining data from the Faint Images of the Radio Sky at Twenty Centimeters (FIRST) sky survey. This test was conducted on data collected at the Very Large Array in New Mexico which seeks to locate a special type of quasar (radio-emitting stellar object) called bent doubles. The FIRST survey has generated more than 32,000 images of the sky to date. Each image is 7.1 megabytes, yielding more than 100 gigabytes of image data in the entire data set. Searching for bent doubles in this mountain of images is as daunting as searching for the needle in the proverbial haystack.
The present invention has an enormous number of uses. It provides a data mining system for scientific, engineering, business and other data. The system has applications which include, but are not limited to the following: astrophysics, detecting credit card fraud, assuring the safety and reliability of the nation""s nuclear weapons, nonproliferation and arms control, climate modeling, the human genome effort, computer network intrusions, reveal consumer buying patterns, recognize faces, recognize eyes, recognize fingerprints, analyze optical characters, analyze the makeup of the universe, analyze atomic interactions, web mining, text mining, multi-media mining, and analyzing data gathered from simulations, experiments, or observations.
Embodiments of the present invention provide scientific researchers with tools for use in plowing through enormous data sets to turn up information that will help them better understand the world around us and assist them in performing a variety of scientific endeavors. Other embodiments of the present invention provide academic and business users with tools for use in plowing through enormous data sets to turn up information that will help them performing a variety of endeavors.
Another embodiment of the present invention is visualized for use in xe2x80x9cThe MACHO Project,xe2x80x9d which is a collaboration between scientists at the Mt. Stromlo and Siding Spring Observatories, the Center for Particle Astrophysics at the Santa Barbara, San Diego, and Berkeley campuses of the University of California, and the Lawrence Livermore National Laboratory. Applicants"" primary aim was to test the hypothesis that a significant fraction of the dark matter in the halo of the Milky Way is made up of objects like brown dwarfs or planets: these objects have come to be known as MACHOs, for MAssive Compact Halo Objects. The signature of these objects is the occasional amplification of the light from extragalactic stars by the gravitational lens effect. The amplification can be large, but events are extremely rare: it is necessary to monitor photometrically several million stars for a period of years in order to obtain a useful detection rate.
The invention is susceptible to modifications and alternative forms. Specific embodiments are shown by way of example. It is to be understood that the invention is not limited to the particular forms disclosed. The invention covers all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the claims.