Complex distributions of computer data inputs are often modeled using a mixture of simpler distributions. Clustering is one of the mathematical tools used to reveal the structure of this mixture. The same is true of data sets with chosen response variables on which a regression analysis can be run. Without separating clusters having very different response properties, the residue error of a regression function is large. Input variable selection could also be misguided to a higher complexity by the mixture.
In Regression-Clustering, K (>1) regression functions are simultaneously applied to a dataset to guide the clustering into K subsets. Each subset has a simpler distribution for matching to the subsets guiding function. Each function is regressed on its own subset of data thereby resulting in a much smaller residue error. Both the regressions and the clustering optimize a common objective function.
Two important data mining techniques include regression on data sets with chosen response variables, and clustering on data sets that do not have response information. An RC process is directed at handling the case in between, e.g., data sets that have response variables but do not contain enough information to guarantee high quality learning. The missing part of the response is essential. Missing information is generally caused by insufficiently controlled data collection, due to a lack of means, a lack of understanding or other reasons. For example, sales or marketing data collected on all customers may not have a label on a proper segmentation of the customers. Clustering processes partition a dataset into a finite number of subsets each containing similar data points. Dissimilarity labeled by the index of the partitions provides additional supervision of the K regressions, running in parallel, so that each regression works on a subset of similar data. The K regressions in turn provide the model of dissimilarity for clustering to partition the data. A “linkage” is a common objective function shared between the regressions and the clustering. Neither can be properly done alone without the other.
Regression-Clustering is not limited to linear regressions, and, when comparing RC between center-based clustering processes, KM (K-Means), KHM (K-Harmonic Means), and EM (Expectation Maximization), the centers are replaced by regression functions. RC refers to a regression-function-centered clustering process. “Clusterwise Linear Regression” uses linear regression and partitioning of the dataset in a process that locally minimizes the total mean square error over all K-regression. Also developed was an incremental version of the process to facilitate adding new observations into the dataset. The Spath process is based on a KM clustering process.
DeSarbo did research on “Clustered Linear Regression” using the same linear mixing of Gaussian density functions. The number of clusters in the work of Hennig is treated as unknown. Gaffney and Smyth's work is also based on an EM clustering process. Gaffney and Smyth showed applications of regression clustering on video stream data to reveal movements in image sequences.
Regression-Clustering finds real-world, practical or industrial application in many situations. In economics, demand curves help people to optimize pricing, see Varian, H. R. (1992), “Microeconomic Analysis,” W. W. Norton & Company; 3rd edition. Better understanding of demand curves also helps companies to design multiple models of a product family to fully deploy the area under the demand curves in different segments of a market. Finding the best market segmentation has to be related to the objective that regression is trying to optimize. Regression-Clustering can accomplish both tasks in an integrated process.
The design of marketing campaigns and offering purchase incentives needs proper segmentation of customers. Without it, marketing campaigns and purchase incentives are blindly given to all potential customers as whole, which is wasteful and less effective. Regression analysis on past marketing campaign data seeks to provide a relationship between an effect and a campaign strategy, e.g., an increase of sales, profit, market share, etc., versus the amount, area or form of the investment, or other. But without proper customer segmentation, regression results are sub-optimal. Regression-Clustering is again a better mathematical tool because Regression-Clustering optimizes both regression and customer-segmentation with a common objective.
In measuring-device calibrations, regression is run on sampled data to calibrate the device's parameters. However, the accuracy of device may depend on many other factors, some of them may not be controllable or even well understood. The data collected using these devices has missing information, which can be handled by Regression-Clustering. These missing variables can be regarded as either missing input variables or missing response variables. Missing input variables may also be handled by Regression-Clustering in certain situations.
Many measuring devices work with single-use measuring agents. The manufacturing variations of the measuring agents from different batches are handled by a code, which selects the best set of parameters among multiple sets pre-calibrated and stored in the device. Such code design is based on many runs of regressions on different batches, a costly and time consuming process. Regression-Clustering can optimize both the regression and the clustering (code design) in one step without human intervention, which means significant savings in both time and labor.
Static or video images can include regions of continuous changes and boundaries of sudden changes in color. A static image can be treated as a mapping from a two-dimensional space to the three-dimensional RGB color-space image:[a,b]×[c,d]→[0,255]×[0,255]×[0,255]. Similarly, a video image can be treated as a mapping from three-dimensional space to another three-dimensional space, video:[a,b]×[c,d]×T→[0,255]×[0,255]×[0,255]. Regression-Clustering is capable of automatically identifying the regions of continuous change and assigning a regression function, which interpolates that part of the image. Both image segmentation and interpolation can be done by Regression-Clustering.
Previous work on RC used K-Means (KM) and Expectation Maximization (EM) in RC processes, these RC processes have the same well-known problem of being sensitive to the initialization of the regression functions, and the K-Means and EM being sensitive to the initialization of the centers. Previously, a center-based clustering process using K-Harmonic Means has been developed Zhang, B., Hsu, M., Dayal, U. (2000), “K-Harmonic Means”, Intl. Workshop on Temporal, Spatial and Spatio-Temporal Data Mining, Lyon, France Sept. 12; Zhang, B. (2001), “Generalized K-Harmonic Means—Dynamic Weighting of Data in Unsupervised Learning,”, the First SIAM International Conference on Data Mining (SDM'2001), Chicago, USA, Apr. 5-7.
A KHM center-based clustering process, described in U.S. Pat. No. 6,584,433, issued to the present Assignee, is much less sensitive to initialization of centers than both K-Means and EM. U.S. Pat. No. 6,584,433 describes a harmonic average data clustering method and system. First, a plurality of data points for clustering is received. Next, a number K of clusters is also received. Then, K center points are initialized. For each center point, a new center position is then determined by using a K-Harmonic Means performance function.
It has been demonstrated through a number of experiments on randomly generated data sets that KHM converges to a better local optimum than K-Means and EM, as measured by a common objective function of K-Means Zhang, B. (2003), “Comparison of the Performance of Center-based Clustering Processes”, the proceedings of PAKDD-03, Seoul, South Korea, April.