This invention relates to adaptive pattern recognition methodologies, and more particularly relates to an improved apparatus and method for clustering data points in continuous feature space by adaptively separating classes of patterns.
It is well known in the prior art that there are myriad applications related to recognizing and classifying collections or groups of multivariate data. Some representative examples include speech recognition in acoustics; classifying diseases based upon symptomatology in health care; classifying artifacts in archaeology; interpreting extraterrestrial electrical signals in space; and interpreting radar.
In the instance of interpreting extraterrestrial electromagnetic signals from space, perhaps pursuant to searching for extraterrestrial intelligence, an adaptive methodology is needed which performs its recognition and classification functions in the absence of a priori templates of these signals. Since the underlying distribution thereof is unknown, it would be advantageous for those skilled in the art to use a methodology which presumes none. An effective search of such a multi-dimensional feature space would preferably depend from all of the collected signals, enabling efficient searching for repetitive and drifting signals intermingled with random noise.
Similarly, in the instance of radar clutter removal, such a methodology could function as an adaptive filter to eliminate random noise signals, i.e., outliers, thereby enhancing signals with high information content. Another application is image restoration which attempts to correlate edge-information between successive images. Such information could be used to enhance edges within an image or across multiple images in real-time.
Other uses of such a methodology include rapid image matching and other processing for locating targets using small infrared detectors and radar transmitters and receivers for military purposes, and in the field of cryptology wherein the goal is to reduce computation times for calculating the greatest common divisor to encrypt or decrypt messages and codes. Another particularly advantageous application is prescreening of data received from monitoring satellites without prematurely dismissing data designated as being spurious.
In still another significant application, there are optical character recognition problems, in which documents are scanned as bitmap images and then converted to machine-readable text. After the images of individual characters have been separated, they are classified as particular characters based upon a multidimensional feature space. Typical features include darkened pixel density, width to height ratio, and Fourier components of its image density projected onto the horizontal and vertical axes. Another interesting illustration is a robotic control system in which feedback from a video camera integrated into a real-time image processing system is used to identify the robot's position in space. This position is obtained from a twofold description thereof: kinematic vector in three-dimensional space and a visual vector in twelve dimensional space of image dot locations.
G-C. Rota in his report entitled "Remarks on the Present of Artificial Intelligence", prepared under the auspices of the Massachusetts Institute of Technology Industrial Liaison Program and published May 11, 1985 as Document No. MIT ILP 500, observed that "the estimation procedures used in today's expert systems are not qualitatively different from the mathematical technique for grouping and clustering that for several years now have been used to automatically catalogue and look up journal articles and scientific citations." Ergo, cataloging and retrieving journal articles based on keywords, as is commonly performed in the art, may be construed as being tantamount to performing cluster analysis on nearby keywords.
Typically, in the prior art, analysis proceeds by applying conventional statistical methodology to sets of measurements of observable characteristics. As is familiar to those skilled in the art, pattern recognition problems are characterized as having data obtained from groups which are either known or unknown a priori. Classifying data into groups which are already known is referred to as supervised pattern recognition or learning with a teacher. The statistical methodology pertinent to this supervised pattern recognition is referred to as discriminant analysis. On the other hand, classifying data into groups which are unknown a priori is referred to as unsupervised pattern recognition or learning without a teacher. The statistical methodology pertinent to this unsupervised pattern recognition is referred to as cluster analysis.
Cluster analysis as heretofore practiced in the prior art has lacked the benefit of unifying principles. Indeed, the various attempts to effectuate cluster analysis have been directed toward heuristically applying procedures to solve specific problems, resulting in limited success.
The state of the art of cluster analysis has been articulated by the distinguished members of the NRC Panel on Discriminant Analysis, Classification, and Clustering, in the paper entitled "Discriminant Analysis and Clustering" which appeared in Statistical Science, February 1989, vol. 4, no. 1, pp. 34-69. As is well known to those skilled in the art, like discriminant analysis, cluster analysis has received considerable attention, with the objective of developing reliable methodologies therefor. However, unlike discriminant analysis, cluster analysis is lacking in a proven methodology, and virtually devoid of concomitant underlying theories therefor. Indeed, in addition to there being no commonly accepted definition of clustering, the clustering methodologies which have been taught in the prior art have suffered from the disadvantage of being limited to the vagaries and unreliability of the ad hoc nature thereof.
More particularly, the inherent complexity and elusivity of performing an effective cluster analysis is attributable to the data-dependent nature of a set of unknown clusters, and their dispersion and relative positions in the data space. Cluster analysis focuses on the underlying organization of collections of observations, whereby similar observations are preferably grouped together into the same "cluster." Thus, clustering may be visualized as a process of homogenizing collections of data into groups of closely related observations. Accordingly, after such a homogenation is performed, each of a myriad collection of observations is associated with one and only one cluster in a data space which is partitioned into a plurality of separate heterogeneous clusters.
The prior art teaches three methodologies for clustering data: hierarchical, partitioning and overlapping. The hierarchical methodology represents data in a tree structure called a dendrogram. At the apex of the tree, each observation is represented as a separate cluster. At intermediate levels of the tree, observations are aggregated into correspondingly fewer clusters. At the base of the tree, observations are, in turn, aggregated into one cluster. If the dendrogram is horizontally cut at a particular level in its structure, a partitioning of its underlying data is achieved. While there are other hierarchically oriented approaches known in the prior art, the paucity of reliable joining and splitting criteria and, of course, the resulting expense of applying these criteria to the observations, limit the derivation of the cluster or clusters which characterize the data.
The partitioning methodology purports to organize observations based upon presumed cluster-centers. After an initial assumption is made regarding the location of the cluster-centers, data is partitioned by assigning each observation to its nearest cluster-center. In this approach, the location of the cluster-centers is iteratively refined, and then the observations are reassigned thereto. These iterative refinements continue until predefined stability criteria are reached. As is well known in the prior art, the partitioning method is slow and depends, inter alia, upon the initial selection of cluster-centers and their susceptibility to change.
The overlapping methodology, at least conceptually, purports to accommodate observations which presumably have overlapping clusters, and accordingly, which cannot be derived from either of the hereinbefore described hierarchical or partitioning methodologies. This approach has, unfortunately, received minimal attention by those skilled in the art.
It should be clear to those skilled in the art, that notwithstanding these limited algorithmic approaches to clustering data, there have been other approaches successfully applied to achieve a grouping of observations. For example, after a presumption is made about the underlying nature of the multivariate data, whereby it is projected onto a two or three dimensional space of greatest dispersion, one method then seeks to ascertain the clusters thereof by visual recognition. Another such methodology, is dynamic graphic systems that seek clusters using the underlying data to be viewed from a plurality of different perspectives. J. W. Tukey and P. A. Tukey, in their paper entitled "Computer Graphics and Exploratory Data Analysis: An Introduction", published in the Proceedings of the Sixth Annual National Computer Graphics Conference, Dallas, Tex. on Apr. 14-18, 1985, pp. 772-784, discuss the prior art's inability to emulate via computer the human eye's perception of dominant patterns and deviations therefrom in multiple-aspect data, without any a priori ideas or information about such patterns. After emphasizing the need for computers to sort out myriad displays for calculating diagnostics quantities for guidance purposes (coined "cognostics"), and the advantage of combining numerical information inherent in distances with the ordered information inherent in ranks, they indicate that, inter alia, the prior art would be advanced by guidance regarding what, how and why to compute, and how to choose what to display. Accordingly, they conclude that good cognostics is a prerequisite, providing useful ideas about types-and-styles of display.
Developing the ability to systematically project the observations upon axes apt to provide clustering direction, and coordinating it with the hereinbefore described human visual identification of data patterns, could provide an effective clustering methodology, particularly in view of the improvements in multi-processor computers, disk storage capacity and access speeds, and the availability and cost of memory.
These clustering methodologies known in the prior art are implemented in three distinct steps: input, process (clustering), and output. In the input step, there occurs selecting, transforming and scaling of variables. Frequently, commitment to a particular distance metric occurs in this input step. Accordingly, the selection of relevant variables is indigenous to an effective clustering analysis. Unfortunately, it is a limitation of the prior art that there is a paucity of statistical procedures to provide guidance selecting variables for clustering. To compensate for this shortcoming, there have been attempts by those skilled in the art to commence clustering with an abundance of variables, with only minimal selection criteria therefor. This approach, has tended to merely dilute the analysis and typically interfere with the normal behavior of the underlying algorithms. Another problem which continues to plague practitioners in the prior art includes how to normalize variables across disparate measurements. Furthermore, the said analyses known in the prior art are limited by and intertwined with the scaling of the features or establishing equivalences, and the selection of the distance measure therefor.
It is generally accepted in the prior art that classification precedes measurement. Known or prior classification is prerequisite for statistically meaningful calculations and for making probability judgments. Ergo, averaging apples with oranges clearly yields a meaningless statistic. A simple example is predicting that a toss of a coin yields "heads" with a probability of 1/2 because it is classified with other recalled similar coin tosses, half of which yielding "heads" and the other half yielding "tails." But statistics cannot provide a complete foundation for classification. It provides, in the context of clustering, an a posteriori means for examining and testing the efficacy of purported clustering results. That is, statistics explores how well particular methodologies function when applied to certain collections of observations.
Misclassification of data points into clusters can occur for two different types of reasons. First, two distinct neighboring clusters may overlap on their boundaries and, as a result of these clusters, may not be recognized as separate. In this clustering concept, the term "fuzz" is used to denote these misleading boundary data points. Second, spurious data points, belonging to no cluster, may be dispersed throughout the data space. Assigning these spurious data points to the nearest cluster distorts the shape and, thus, the statistics which characterize this cluster. In this clustering concept, the term "fuzz" is also used to denote these misleading spurious data points which belong to no cluster.
The remaining data points, i.e., the non-fuzzy data points, have a high probability of correct classification into clusters. However, because classification precedes measurement, it is generally impossible to distinguish these two type of misleading fuzzy data pints from non fuzzy data points, until after the clusters are identified. It is advantageous to identify both types of fuzzy data points during the clustering, and use this information to better recognize "true" cluster boundaries. Accordingly, it would be advantageous if a clustering methodology were available in which classification does not precede measurement of fuzz, and treated fuzz as a Gestalt: as a function of its position in the data space.
As is well known to those skilled in the art, agglomerative methods of clustering can be poor estimators of high density clusters. They typically invoke the complete-linkage method, representing the distance between clusters as the maximum distance between points in adjacent clusters. Unfortunately, being easily upset by the fuzz of observations between high density regions, this complete-linkage approach is the worst of all the conventional methods for determining high density clusters. During the workings of this method, pertinent information about the distribution of points is lost. By contrast, the average-linkage method, representing the distance between clusters as the average distance between pairs of points in adjacent clusters, is more sensitive than the complete-linkage to population distribution because the distance metric is affected by the number of points in the clusters. Thus, if two neighboring clusters are formed in the vicinity of a high density region, then the intercluster distance will be smaller than usual because of the plurality of proximal points, whereby these neighboring clusters will tend to be aggregated.
The prior art further teaches that there are disjoint high density clusters only if the density is multimodal. Accordingly, a rough test for the presence of clusters is seeking multimodality in the data space or some lower dimension projection; this modus operandi corresponds to a kind of visual, squinting search for clustered observations. In the case of a one dimensional data space, these unimodal and bimodal densities may be estimated by a maximum likelihood fit. But, as is known to those skilled in the art, it is difficult to deal with the disproportionate influence upon this density fit by small intervals between neighboring observations. Two or three extremely close data points not near any mode distort this statistic, which presumes that points disposed remote from a mode are, in turn, disposed apart from each other. It is typically a superior approach to use the dip test, which measures the maximum difference between the empirical distribution function, and the unimodal distribution function chosen to minimize the said maximum difference. The dip approaches zero for unimodal distributions, and a non-zero value for multimodal distributions. There has been posited a theory that the uniform distribution is the appropriate null modal distribution because the dip is asymptotically stochastically larger for the uniform than for other unimodal distributions.
It is still another limitation of the prior art that the dip test does not generalize to multidimensional data space. The minimum spanning tree method provides a distance ordering of n sample points for which a dip statistic may be calculated. While a purported optimal mode is located with an estimate of the fit of this unimodal hypothesis, which appears to correspond to the one dimensional dip test, the asymptotic extension of the dip test to multidimensional data is unknown in the art.
If the components of a normal mixture in a multidimensional data space are sufficiently separated, there is typically one multidimensional mode for each such component. It should be observed that several modes may coincide after projection onto a subspace or an axis. Hence, the number of one dimensional modes of a single variable may be less than the number of components or clusters. As should be apparent to those skilled in the art, the number of clusters and the number of components or multidimensional modes will be equivalent. Clusters separated in n-dimensions may project onto the same one dimensional mode in the graph of a single variable, termed "spectrum".
For instance, in a paper entitled "O(log n) Bimodality Analysis" by T. Phillips, A. Rosenfeld and A.C. Sher, which was published in Pattern Recognition, vol. 22, no. 6, pp. 741-746 (1989), describes a method to split a population into two subpopulations based upon exhaustive divide-and-conquer calculations. This method attempts to detect the presumed bimodality of a large population of visual features by applying the Gestalt principle of similarity-grouping whereby a display is construed to be comprised of an admixture of two different species of elements, e.g., large and small dots, horizontal and vertical lines. The crux of this methodology is the definition of a measure of the bimodality of a population based upon its partitioning into two subpopulations that have maximal Fisher distance, provided that the underlying population is an admixture of two Gaussians with sufficiently large Fisher distances. The said divide-and-conquer methodology involves computing a set of subpopulation sizes, means and variances. More particularly, the population is recursively partitioned until the maximal value of the square of the Fisher distance thereof is obtained. In a single Gaussian population, this metric reaches a maximum where the population is partitioned into an outlier farthest from the mean and the remainder thereof. Unfortunately, this one dimensional splitting modus operandi provides no clue whether the underlying population is indeed an admixture of Gaussians, or whether it is bimodal.
Furthermore, the authors acknowledge that applying this methodology to large bimodal populations with more than two clusters of visual properties is difficult and may be impractical because it engenders a combinatorial problem of subdividing populations into more than two partitions. If an outlier event comprises the second partition, then it is not clear whether the population is actually bimodal. Since the bimodal designation implies that there are two clusters in a population, on what basis can a singleton partition be discarded?. By not considering the events as a Gestalt, merely a possible component of the boundary fuzz is isolated. As an example, this procedure may be falsely classifying events as outliers, notwithstanding such events being simply offset from their respective cluster center in one dimension. If, however, there exists an offset in all dimensions, then the cumulative fuzz is more significant, and a spurious outlier event may in actuality exist.
Those skilled in the art are familiar with software packages which implement algorithms and procedures purportedly designed to accomplish cluster analysis. For example, the NT-SYS and NTSYS-pc packages contain the conventional hierarchical agglomerative methods of cluster analysis. As other examples, the BMDP, SAS, and SPSS-X packages provide procedures for conventional hierarchical agglomerative methods and iterative partitioning. Another, more versatile software package for cluster analysis, is CLUSTAN which not only offers procedures for conventional hierarchical agglomerative methods and iterative partitioning, but also offers procedures for decomposition of normal multivariate mixtures using the minimum spanning tree. CLUSTAN provides a plurality of cluster-oriented utilities including validation criteria, diagnostics, and similarity coefficients. Notwithstanding this diverse assortment of clustering procedures and functions, it unfortunately affords merely minimal direction for selecting a procedure to apply to a set of observations.
Another software package, MICRO-CLUSTER, provides several hierarchical agglomerative methods and an iterative partitioning method to effect clustering. A particularly interesting class of cluster analysis computer programs are those intended to process large collections of data, where there are at least 500 data points. In spite of this intent to entertain real world, large-scale problems, clustering methods known in the prior art are limited to analyzing only about 200 cases simultaneously.
Furthermore, there are few packages which estimate local density, with no probability distribution assumptions. For example, ALLOC, a package which computes allocation rules as a function of density estimation, uses multivariate normal kernels with a diagonal covariance matrix. IMSL, a package which performs linear and quadratic discriminant analysis, is capable of performing density estimate analysis. The IMSL nearest neighbor procedures are particularly useful when distributions deviate far from normal. These nearest neighbor procedures, functionally related to nonparametric density estimates, are also available in the SAS package.
As is apparent to those knowledgeable in the art, none of these available software packages provides guidance for selecting clustering method and relevant data, organizing the observation data, resolving outlier criteria, or selecting similarity metrics. Hence, there is a need in the prior art for an effective interactive exploratory graphic package for ascertaining classifications of large numbers of observations. It should also be apparent that a useful clustering tool is also needed to facilitate selecting relevant variables and to distinguish observations as belonging to a particular cluster or being an outlier thereof.
It should be clear that the prior art lacks a foundation and a standard, reliable methodology for cluster analysis. Indeed, a methodology is needed whose results are independent of the vagaries of the multivariate observations in n-dimensional data space, such as scaling. As hereinbefore described, each of the clustering methodologies heretofore known to those skilled in the art yield different classes of patterns depending upon the algorithms invoked thereby, with its concomitant presumptions regarding a distance metric and scaling. Unfortunately, the prior art lacks an objective mechanism for measuring the efficacy of such classifications determined in the absence of a prior knowledge.
It is accordingly a disadvantage of the teachings of the prior art that insufficient experience is available pertaining to attempting clustering under circumstances for which there is limited, if any, a prior knowledge about the distribution of the underlying observations. It should be clear that unsupervised learning involves considerably more sophistication, including convergence criteria, than does supervised learning.
An attempt to overcome the limitations of the prior art is described by M. D. Eggers and T. S. Khoun of the Massachusetts Institute of Technology Lincoln Laboratory in their paper entitled "Adaptive Preprocessing of Nonstationary Signals" which was published in Technical Report 849 on May 9, 1989. Eggers and Khoun sought to produce a compact feature vector by providing a preprocessor for large volumes of signal data as a precursor to automated decision-making, without any foreknowledge of the underlying statistical distribution or expert rules. In particular, their methodology sought to reduce the said large data volumes without destroying the integrity of the information contained therein. As an improvement over the single window method known to those skilled in the art, there is used a dual window approach incorporating a cumulative sum statistic therein, whereby drift properties are exploited based upon a functional relationship with changes in spectral characteristics. This approach purports to overcome the inherent disadvantages of the said single window procedure caused by using the difference in the absolute mean as a measure of spectral behavior. However, its adaptiveness quotient is limited because a prior learning parameters must be presumed before the window sizes may be established. Furthermore, neither of these windows are adjusted during the execution of the procedure. Accordingly, while the Eggers-Khoun method affords an adaptable procedure for producing a compressed feature vector, it fails to provide an adaptive clustering technique heretofore unknown in the art, particularly a methodology with a natural grouping based upon and adaptively interdependent with the totality of data space and, of course, not restricted to windowing assumptions.
In U.S. Pat. No. 3,457,552, Asendorf discloses a learning method characterized as an "adaptive self-organizing pattern recognition system." Modifying output responses during its training phase to enable known patterns to be recognized, the Asendorf system strives to produce a unique output signal associated with a particular known pattern. Prerequisite to this method, however, is that patterns are recognized based upon pretaught patterns whose classification is known with absolute accuracy. Furthermore, the training aspect of the Asendorf invention is limited to data for which output signals related to different patterns may be distinguished from each other, by a human teacher particularly without increasing the complexity and cost of the system. Thus, providing the means to learn to recognize or classify patterns based soley upon a prior data, Asendorf's system is not adaptive, but adaptable in the sense that the threshold for screening output signals is adjusted to enable the classification thereof.
Cooper, et al., in U.S. Pat. No. 4,326,259, teach a system for a self-organizing general pattern class separator and identifier. In a multidimensional event space, each event is represented by a signal vector which is received by a plurality of input terminals interconnected with a plurality of junction elements and a summing device. Notwithstanding this "self organizing" nomenclature, the Cooper system cannot function in its normal trained mode of operation without its precursor training mode. Ergo, human intervention is prerequisite to iteratively selecting a proper scalar factor from observed events during this training mode, to properly filter out unwanted outputs and thereby eliminate errors of separation and identification. Indeed, based upon classification of events by an omniscient observer, a feedback loop between the summing device and junction elements effects the variation of the elements' transfer function, thereby modifying the said scalar factor. It should be apparent that the self-organizing feature of the Cooper invention, purportedly to separate and identify a particular group of events of related classes of events, does not attach until virtually all of the pattern recognition has been learned and organized.
By contrast, in U.S. Pat. No. 4,038,539, Van Cleave teaches a means and method for adaptively filtering input pulse signals occurring at unknown times and having unknown durations. By controlling the characteristic of the filtering means, the output thereof delivers signals corresponding to pulse signals of the input signals with concomitant minimum noise signals. The Van Cleave method invokes polynomial transformations for processing of the input signals and adaptively filtering the orthogonal signal components for providing at the output thereof, pulse signals present in the input signals with reduced noise therein. It is a disadvantage of this method, however, that there are inherent time delays during the signal analysis and orthogonal component filtering steps, and that a priori selection of preprogrammed threshold constants must be made prior to commencing operation thereof.
In U.S. Pat. No. 4,730,259, Gallant discloses an expert system in which an inference engine is controlled by a matrix of learning coefficients. Prerequisite to its operation, however, the Gallant engine's learning must be supervised. More particularly, its operation is governed by a matrix generated by training examples or a priori rules. Not only must its primary and goals variables be known a priori, but also the system goals are not determined adaptively. Gallant's intermediate and goal variables may require expert-generated dependency lists for its operation, but there is no assurance of convergence thereof nevertheless. That is, as stated therein, it may not always be possible to construct a matrix satisfying all of the prerequisite training examples. This could occur if some clusters overlapped in their outer regions.
It should be apparent that another limitation of Gallant's engine is that it seeks cause and effect relationships which can easily be destroyed by misclassified outliers and the like. In addition, it inherently has no internal provision for evaluating the integrity and reliability of a learning example. This shortcoming, when unreliable examples are invoked, clearly will produce contradictory supervisory rules. In view of these disadvantages and limitations, Gallant's invention functions more like an adaptable assistant, requiring human intervention for even a chance at convergence, rather than an inference engine as originally described therein.
Due to the requirement that its features be specified in advance as primary variables, the Gallant engine precludes adaptively finding goals. This, of course, follows from its a priori definition of goal variables as well as primary variables, and its underlying model of cause and effect relationships. It is also a limitation of Gallant that his engine would probably fail with increasing examples because of the corresponding increasing probability of misclassification of outliers, and the consequent contradictory rules and uncertain convergence there related.
Another attempt to improve the prior art is disclosed by Penz, Gately and Katz, in U.S. Pat. No. 4,945,494, in which neural networks are applied to cluster a plurality of radar signals into classes and a second neural net to identify these classes as known radar emitters. As is well known in the prior art, when neural nets are implemented on digital computers, the training procedures (learning phases) tend to be very slow. In addition, the proper choice of multiple hidden layers of neurons lacks guidance so that predicting the outcome of these choices is virtually impossible. Penz overcomes these two limitations by using programmable resistive arrays to implement the neural network instead of a digital computer. This resistive array has no hidden layers and can be trained very quickly.
An unavoidable limitation of neural networks, however, is their tendency to saturate. As Penz, points out, the well learned Eigenvector, i.e., the cluster with many observations, will be recalled, while the poorly learned Eigenvector i.e., the cluster with few observations, will be lost. After the input vectors are learned by the neural network, the clusters of vectors must still be separated from each other. Unfortunately, this separation can fail due to saturation. A sparse, small cluster bordering a large, dense cluster will tend to be absorbed and thereby lost during this separation phase.
But a limitation more critical than saturation is neural networks' inability to be effectively scaled. The number of connections required between each pair of neurons preclude the use of neural networks to cluster data vectors having thousands of components. In other words, building resistive neural networks to handle very large sensor arrays is a problem because of the huge number of connections. For example, a resistive array having 1,000 neurons requires about half a million connections to represent the matrix of T.sub.ij values and this exceeds the capability of today's chip fabricators. This is clearly elucidated in the Defense Advanced Research Project Agency (DARPA) Neural Network Study conducted under the auspices of the MIT Industrial Liaison Program published by Lincoln Laboratory, P. J. Kolodzy and J. Sage September 1989. Two approaches still under development for overcoming this limitation on scaling are to fabricate 3-D chips and to connect large numbers of chips optically.
The present invention overcomes these limitations of saturation and scaling in neural networks. In addition, the digital implementation of the present invention is as fast in the learning phase as the implementation of a neural network using resistive arrays. Moreover, the present invention implemented with content addressable memories is faster than any neural network, and is more intuitive regarding the prediction of the outcome of the clustering based upon parameter choices.
Thus, as is well known by those skilled in the art, there has heretofore been no unsupervised pattern recognition system applicable to large data sets of at least one thousand (1,000) observations. Those systems which have been developed for small data sets, typically constituting two to five hundred (200-500) observations, suffer from memory and execution constraints. In the book, "Clustering of Large Data Sets" by J. Zupan, it is reported that such an attempted clustering of five hundred (500) objects representing infrared spectra using one hundred sixty (160) features consumed forty (40) hours of computer time on a PDP 11/34 machine. Clearly, a means and method are needed which can solve classification problems of these and even larger magnitude within minutes or less by using parallel processors.
As hereinbefore described, it is also a limitation of the prior art that the separations and concomitant classifications achieved by clustering methodologies known in the prior art can and usually do depend from the selection of feature scaling factors and distance metrics. It should be apparent that the pattern of classes obtained from a clustering methodology should preferably be independent of such selections.
The present invention can also be scaled to cluster vectors having an unlimited number of components with no loss of speed and no increase in circuit complexity, i.e., neuron-neuron connections. This scaling is achieved by using a separate, isolated processing unit for each component of the data vector, which, of course, cannot be done within the paradigm of neural networks, since each neuron must communicate with every neuron.
The present invention also avoids the problem of saturation since it does not seek to identify classes of vectors before separating them. Indeed, the present invention partitions the data space into regions based upon voids in the data space. This contrasts with neural network data space partitions which essentially seek the histograms of the clusters before attempting to separate them. Saturation is the overwhelming effect that a well learned cluster has upon a poorly learned nearby cluster. If the two histograms are not different enough to cause a separation of these clusters, then the poorly learned histogram will be merged with the well learned histogram, resulting in a single cluster. Thus, the present invention seeks to separate clusters without attempting to identify the modes in the multidimensional histograms which characterize these clusters. Instead, the large gaps between the modes in the histogram are sought, and, only after this separation is finished, are the histograms used to determine the centers or modes of these clusters.
Accordingly, these limitations and disadvantages of the prior art are overcome with the present invention, and improved means and techniques are provided which are especially useful for clustering of large numbers of data points in continuous real numerical feature space by adaptively separating classes of patterns.