This invention relates in general to a network of databases and, in particular, to collective data mining from a distributed, vertically partitioned feature space.
Distributed data mining (DDM) is a fast growing area that deals with the problem of finding data patterns in an environment with distributed data and computation. Although today most of the data analysis systems require centralized storage of data, the increasing merger of computation with communication is likely to demand data mining environments that can exploit the full benefit of distributed computation. For example, consider the following cases.
1. Example I: Imagine an epidemiologist, studying the spread of hepatitis-C in the U.S. She is interested in detecting any underlying relation of the emergence of hepatitis-C in U.S. with the weather pattern. She has access to a large hepatitis-C database at the Center for disease control (CDC) and an environmental database at EPA. However, they are at two different places and analyzing the data from both of them using a conventional data mining software will require combining the databases at a single location, which is quite impractical.
2. Example II: Two major financial organizations want to cooperate for preventing fraudulent intrusion into their computing system. They need to share data patterns relevant to fraudulent intrusion. However, they do not want to share the data since it is sensitive. Therefore, combining the databases is not feasible. Existing data mining systems cannot handle this situation.
3. Example III: A defense organization is monitoring a situation. Several sensor systems are monitoring the situation and collecting data. Fast analysis of incoming data and quick response is imperative. Collecting all the data to a central location and analyzing it there consumes time and this approach is not scalable for state-of-the-art systems with a large number of sensors.
4. Example IV: A drug manufacturing company is studying the risk factors of breast cancer. It has a mammogram image database and several databases containing patient tissue analysis results, food habits, age, and other particulars. The company wants to find out if there is any correlation between the breast cancer markers in the mammogram images with the tissue features or the age or the food habits.
5. Example V: A major multi-national corporation wants to analyze the customer transaction records for developing a successful business strategy quickly. It has thousands of establishments throughout the world and collecting all the data to a centralized data warehouse, followed by analysis using existing commercial data mining software, takes about a month of the time of the data warehouse team.
DDM offers an alternate approach to the analysis of distributed data that requires minimal data communication. Typically DDM algorithms involve local data analysis and generation of a global data model by combining the results of the local analysis. Unfortunately, naive approaches to local analysis may be ambiguous and incorrect, producing an incorrect global model. Particularly in the general case, where different sites observe different sets of features, this problem becomes very critical. Therefore developing a well-grounded methodology to address this general case is important. This paper offers a viable approach to the analysis of distributed, heterogeneous databases with distinct feature spaces using the so-called collective data mining (CDM) technology.
Section 2 describes the DDM problem considered here and some of the problems of naxc3xafve data analysis algorithms in a DDM environment. In Section 3, the foundation of CDM is presented followed by a discussion on construction of orthonormal representation from incomplete domains and the relation of such representation to mean square error. Sections 4 and 5 present the development of CDM versions of two popular data analysis techniques, decision tree learning and regression. Section 6 presents an overview of a CDM based experimental system called BODHI, that is currently under development. Section 7 summarizes the CDM work presented here, including the BODHI system, and discusses future research directions.