In recent years, the progress of information automation has increased the computer databases of modem businesses to the point where a blizzard of numbers, facts and statistics are collected and stored, but less information of any significance is extracted from the massive amounts of data. The problem is that conventional computer databases are powerful in the manner in which they house data, but unimaginative in the manner of searching through the data to extract useful information. Simply stated, the use of computers in business and scientific applications has generated data at a rate that has far outstripped the ability to process and analyze it effectively.
To address this problem, a practice known as data "mining" is being developed to identify and extract important information through patterns or relationships contained in available databases. Humans naturally and quickly "mine" important concepts when interpreting data. A person can scan a magazine's table of contents and easily select the articles related to the subject of interest. A person's ability to extract and identify relationships goes beyond the simple recognizing and naming of objects, it includes the ability to make judgments based on overall context or subtle correlations among diverse elements of the available data. Computers on the other hand, cannot efficiently and accurately undertake the intuitive and judgmental interpretation of data. Computers can, however, undertake the quantitative aspects of data mining because they can quickly and accurately perform certain tasks that demand too much time or concentration from humans. Computers, using data mining programs and techniques are ideally suited to the time-consuming and tedious task of breaking down vast amounts of data to expose categories and relationships within the data. These relationships can then be intuitively analyzed by human experts.
Data mining techniques are being used to sift through immense collections of data such as marketing, customer sales, production, financial and experimental data, to "see" meaningful patterns or groupings and identify what is worth noting and what is not. For example, the use of bar-code scanners at supermarket checkouts typically results in millions of electronic records which, when mined, can show purchasing relationships among the various items shoppers buy. Analysis of large amounts of supermarket basket data (the items purchased by an individual shopper) can show how often groups of items are purchased together, such as, for example, fruit juice, children's cereals and cookies. The results can be useful for decisions concerning inventory levels, product promotions, pricing, store layout or other factors which might be adjusted to changing business conditions. Similarly, credit card companies, telephone companies and insurers can mine their enormous collections of data for subtle patterns within thousands of customer transactions to identify risky customers or even fraudulent transactions as they are occurring. Data mining can also be used to analyze the voluminous number of alarms that occur in telecommunications and networking alarm data.
The size of the data set is essential in data mining: the larger the database, the more reliable the relationships which are uncovered. Large databases, unfortunately, have more records to sift through and require more time to pass through the records to uncover any groupings or pattern regularities. The number of items for which the relationships are sought is also important to the efficiency of data mining operations: the larger the number of items, the more time to pass through the records that are required to extract reliable information.
A fundamental problem encountered in extracting useful information from large data sets is the need to detect natural groupings that can be targeted for further study. This problem is frequently encountered when data mining is applied to image processing or pattern recognition. For example, pattern recognition in the social sciences, such as classifying people with regard to their behavior and preferences, requires the grouping of data that may not be uniformly occupied in the dataset being analyzed. This is also true in image recognition, which attempts to group and segment pixels with similar characteristics (such as text or graphics) into distinct regions.
In data mining, the act of finding groups in data is known as clustering. Data clustering identifies within the dataset the dense spaces in a non-uniform distribution of data points. The partitioning of a given set of data objects (also referred to as "pattern vectors") is done such that objects in the same cluster share similar characteristics and objects belonging to different clusters are distinct. Because clustering results vary with the application requirements, several different approaches exist in the prior art. These can be broadly classified as follows:
Partitioning operations: These operations partition a dataset into a set of k clusters, where k is an input parameter for the operation. Each cluster is represented either by the gravity center of the cluster (know as a k-means method) or by one of the objects of the cluster located near its center (know as a k-medoid method). The k representatives are iteratively modified to minimize a given objective function. At each iteration, the cluster assignment of the remaining points are changed to the closest representative.
Density-based operations: These operations use information about the proximity of points for performing the clustering. The formulations are based on the assumption that all objects which are spatially proximate belong to the same cluster. This definition is typically extended for transitivity (i.e., if A is proximate to B and B is proximate to C then A and C are also proximate).
Hierarchical operations: These operations create a hierarchical decomposition of the entire data set. This is typically represented by a tree. The tree can be built bottom-up (agglomerative approach) by combining a set of patterns into larger sets, or top-down (division approach) by iteratively decomposing a set of patterns into smaller subsets.
Clustering operations are compared based on the quality of the clustering achieved and the time requirements for the computation. Depending on the application requirements, one or more of the above features of the clustering operation may be relevant.
The k-means method has been shown to be effective in producing good clustering results for many practical applications. However, a direct implementation of the k-means method requires computational time proportional to the product of number of objects and the number of clusters per iteration. For large databases, the repetitive iterations needed to yield good clustering results can be time consuming and computationally intensive.
What is needed, then, is an improved method and apparatus for implementing k-means data clustering, which produces the same or comparable (due to the round-off errors) clustering results to the k-means data clustering method as is currently practiced, while reducing the computational requirements.
Accordingly, it is an object of the present invention to provide a data mining method and apparatus for initially organizing the data objects so that all of the objects which are actually closest to a preselected sample cluster are identified quicker and more efficiently.
It is another object of the present invention to provide a data mining method and apparatus which reduces the number of distance calculations and overall time of computation without affecting the overall accuracy of performing k-means data clustering over a given dataset, as compared with the prior art methods.
It is still another object of the present invention to accomplish the above-stated objects by utilizing a data mining method and apparatus which is simple in design and use, and efficient to perform with regard to data clustering computations.
The foregoing objects and advantages of the invention are illustrative of those which can be achieved by the present invention and are not intended to be exhaustive or limiting of the possible advantages which can be realized. Thus, these and other objects and advantages of the invention will be apparent from the description herein or can be learned from practicing the invention, both as embodied herein or as modified in view of any variation which may be apparent to those skilled in the art. Accordingly, the present invention resides in the novel methods, arrangements, combinations and improvements herein shown and described.