This invention relates to clustering, for example, of seasonalities to better forecast demand for items of commerce.
Clustering means grouping objects, each of which is defined by values of attributes, so that similar objects belong to the same cluster and dissimilar objects belong to different clusters. Clustering has applications in many fields, including medicine, astronomy, marketing, and finance.
Clustering is done on the assumption that attribute values representing each object to be clustered are known deterministically with no errors. Yet, often, the values representing an object to be clustered are not available. Sometimes statistical methods are used to get estimated or average values for a given object.
In general, in one aspect, the invention features a method that includes (a) receiving a set of data containing values associated with respective data points, the values associated with each of the data points being characterized by a distribution, (b) expressing the values for each of the data points in a form that includes information about a distribution of the values for each of the data points, and (c) using the distribution information in clustering the set of data with at least one other set of data containing values associated with data points.
Implementations of the invention may include one or more of the following features. The respective data points are related in a time-sequence. The data points relate to a seasonality of at least one item of commerce. Each of the sets of data relates to seasonalities of items of commerce. The items of commerce comprise retail products, the data points relate to times during a season, and the values associated with each of the data points correspond to respective ones of the retail products. The method also includes determining statistical measures of the variability of the values with respect to the data point. The data is expressed in a form that includes a mean of the values associated with a data point and a statistical measure of the distribution with respect to the mean. The statistical measure comprises a standard deviation. The clustering of data includes measuring a distance between pairs of the sets of data. The distance is measured based on the means and variances at the data points. The distribution of the values is Gaussian. The clustering of data includes merging the data sets belonging to a cluster using a weighted average. The method includes merging the seasonalities of the data sets belong to a cluster.
In general, in another aspect, the invention features a machine-accessible medium that when accessed results in a machine performing operations that include: (a) receiving a set of data containing values associated with respective data points, the values associated with each of the data points being characterized by a distribution, (b) expressing the values for each of the data points in a form that includes information about a distribution of the values for each of the data points, and (c) using the distribution information in clustering the set of data with at least one other set of data containing values associated with data points.
In general, in another aspect, the invention features a method that includes (a) receiving sets of data, each of the sets containing values associated with respect data points, the values associated with each of the data points being characterized by a distribution, (b) evaluating a distance function that characterizing the similarity or dissimilarity of at least two of the sets of data, the distance function including a factor based on the distributions of the values in the sets, and (c) using the evaluation of the distance function as a basis for clustering of the sets of data.
In general, in another aspect, the invention features a method that includes (a) receiving data that represents seasonality time-series for each of a set of retail items, the data also representing error information associated with data values for each of a series of time points for each of the items, and (b) forming composite seasonality time-series based on respective clusters of the retail item seasonality time-series, the composites formed based in part on the error information.
Other advantages and features will become apparent from the following description and from the claims.