Machine learning combines techniques from statistics and artificial intelligence to create algorithms that can learn from empirical data and generalize to solve problems in various domains such as natural language processing, financial fraud detection, terrorism threat level detection, human health diagnosis and the like. In recent years, more and more raw data that can potentially be utilized for machine learning models is being collected from a large variety of sources, such as sensors of various kinds, web server logs, social media services, financial transaction records, security cameras, and the like.
Clustering, or partitioning a set of observation records into multiple homogeneous groups or clusters based on similarities among the observations, is one of the more frequently used machine learning techniques. For example, at web-based retailing organizations, observation records associated with customer purchases or customers' web-page browsing behavior may be clustered to identify targets for customized sales promotions, advertising, recommendations of products likely to be of interest, and so on. Clustering may also be used as one of the steps in generating predictive machine learning models from raw observation records, e.g., to derive features with higher predictive utility than the raw observations, to reduce dimensionality, or simply to compress the raw data. Observation records may sometimes be clustered to help interested parties (e.g., managers or other decision makers at the organizations at which observation records are collected) gain additional insights into relationships among different segments of the data, e.g., to help decide as to how a given data set can best be utilized for business purposes.
Observation records of machine learning data sets may include values of a number of different types of attributes, such as numeric attributes, binary or Boolean attributes, categorical attributes and text attributes. The sizes of the data sets used for many machine learning applications, such as deep learning applications, can become quite large. Some machine learning data sets may include values for dozens or hundreds of attributes of different types, and a given data set may contain millions of observation records. For such data sets, it may not be straightforward to determine the relative importance of different attributes with respect to clustering. In general, clustering large data sets whose observation records include values for the different kinds of attributes may present a non-trivial challenge for several reasons—e.g., because of the level of statistical expertise which may be required, and/or because of the high requirements for resources such as computing power, memory, and storage.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.