The present invention generally relates to methods for demand modeling in retail categories, e.g., for use in a retail decision-support applications; and, more particularly, the present invention relates to a method for demand modeling in retail categories using retail sales data sets with missing data elements.
The demand models required for decision-support applications widely used by various entities in the retail supply chain such as, for example, consumer product manufacturers, consumer retail chains, and individual retail stores, are typically obtained by fitting regression models to the historic sales data for the relevant products in the retail category. Subsequently, depending on the application, these fitted regression models are used to obtain predictions for the demand and/or the price sensitivity of demand for the relevant products in the retail category, based on the various marketing actions and market conditions.
In a current embodiment of demand modeling, the retail sales data sets that are required for the regression analysis are typically obtained in the form of time-series sequences of unit-prices and corresponding unit-sales for the relevant products in the specified retail category of interest, over a collection of stores in the particular market geography of interest. The typical time period for the sales reporting in these retail data sets is weekly, and the sequence of time series values can range over a period of several months to years.
In essence, therefore, these sales data sets are typically comprised of individual data records containing the unit-price and corresponding unit-sales values for a set of relevant products (whose elements are indexed by “p”), a set of relevant stores (whose elements are indexed by “s”), and the set of consecutive time-periods (whose elements are indexed by “t”). Parenthetically, it is noted that the unit-price field may not be explicitly specified in the sales data sets, but its value can always be readily ascertained, for example, from the total-revenue and unit-sales fields for each (p, s, t) combination.
Suitable sales data sets for the demand modeling analysis, of the form described above, may be obtained from a variety of sources, including, for example, retail point-of-sales systems, vendor-managed inventory and billing systems, and databases of commercial aggregators of retail data such as Nielsen (http://en-us.nielsen.com) and SymphonyIRI (http://www.symphonyiri.com).
Furthermore, whenever possible, these sales data sets can also be augmented with other data sources containing information of brand advertising, promotional data, shelf and display data, and product stock and inventory positions, which will improve the accuracy and interpretability of the resulting demand models. However, the use of these additional data sources has often been limited, primarily due to the difficulty of acquiring this data, which even when acquired is often incomplete and may contain missing data elements, which in turn leads to difficulties in using the usual methods for demand modeling analysis in the prior art that invariably require complete data sets with no missing data elements.
The use of multi-product and multi-store data, as described above, can be of considerable value for demand modeling in a retail category. For instance, in many applications, the specific focus of the demand-modeling analysis is on a single product or on a small set of target products of interest, but nevertheless, it is always advantageous to jointly model the demand for these products in the context of a larger subset of products, even perhaps the entire retail category that contains this product subset, in order to ensure that any relevant cross-elasticity effects due to product substitution or product “drag” are properly incorporated in the demand modeling analysis. Product substitution refers to the substitution of a promoted product for a competitor product to which it is equivalent in consumer functionality which leads to a cannibalization of the competitor product sales, and product drag refers to the ability of a promoted product to increase the sales of the associated non-substitutable products that tend to be jointly purchased with the promoted product. Similarly, in many applications, the specific focus of the demand-modeling analysis is often on a single store of interest, but nevertheless, it is always advantageous to examine the sales data across multiple stores that stock the same product set for the retail category, so that for instance, the data may be pooled across the stores to reduce the estimation errors for the parameters in the demand model.
Alternatively, rather than pooling, the store-level data may used to identify important store-level effects on the demand models, as described in the prior art by P. Chintagunta, J. P. Dube and V. Singh, “Market structure across stores: An application of a random coefficients logit model with store level data,” in Advances in Econometrics, eds. P. H. Franses and A. Montgomery, Amsterdam N.Y., JAI Press, 2002.
It would be highly desirable to provide multi-product, multi-store and multi-time period data sets for demand modeling, that addresses a pervasive limitation that arises, in this regard, due to the invariable presence of missing data records and missing data elements in the relevant sales data sets for specific combinations of product “p”, store “s” and time-period “t”.
There is now considered some of the limitations of the prior art for the specification and imputation of the missing data elements.
For instance, one approach that is widely used in the prior art for missing data elements in demand modeling analysis, is to simply exclude the entire set of related data records for all products in the modeling choice set for any (s, t) combinations for which there is even a single product that has missing data elements; this is the so-called “record deletion” approach (which is also often termed the “complete case” approach), and is necessary in this context since, by default, the methods in the prior art for demand modeling analysis cannot use any sales data records for which even one of the products in the choice set with the same (s, t) combination have missing target values (or in this case, equivalently, missing unit sales values). That is, the demand modeling analysis must exclude the data records for all products for a particular (s, t) combination, if even a single product in this choice set has a missing data record for that (s, t) combination. It can be readily seen that this “record deletion” approach will significantly reduce the size of the data set, and lead to a large number of data records with valid and non-missing values in the sales data set also being excluded from the demand modeling analysis, in addition to the typically smaller fraction of data elements that actually have missing values.
An alternative approach to “record deletion” that is also widely used in the prior art, which does not require having to discard the entire set of valid data elements for an (s, t) combination for which even a single product in the choice set has missing data, is termed “complete case” analysis, which is based on examining the pattern of the missing data elements in the sales data set, in order to obtain probabilistic estimates for the missing data elements, and in this way to “complete” the data set for the demand modeling analysis. For example, a sequence of missing values in the time series for a given (p, s) combination, at either the beginning or end of the time series data set, strongly suggests that these missing values have root cause which can be attributed to the late introduction or early withdrawal of the product in the specific store; therefore, this clearly corresponds to a root cause for which the corresponding unit-sales can be unambiguously specified to be zero.
Alternatively, in situations where there are actual non-zero unit sales for a particular (p, s, t) combination, but nevertheless, the relevant data record was omitted from the data set, the missing unit-sales and unit-price values for the data element corresponding to a certain (p, s, t) combination can be imputed by replacing it by the mean (that is, the average) of the corresponding values of unit-sales and unit-price over the other stores in the same retail chain for the same (p, t) combination. The approach of imputing any missing values by their corresponding mean value over the remaining non-missing data values has the disadvantage that it deflates the variance of the unit-sales and unit-price variation, distorts the cross-product and cross-time correlations in the unit-prices and unit-sales data, and biases the relationship between the unit-price and unit-sales in the “complete” data set that contains these mean-imputed values.
In view of this it would be highly desirable to provide an improved system and method for specifying or imputing missing data elements in the retail-sales data sets used for demand modeling.
There are many developments in the prior art for imputing missing-value in data sets that are used for statistical analysis, wherein in general terms, the relevant methods are based on classifying the mechanism that is responsible for the pattern of missing values in the data sets, as described by R. Little and D. Rubin, “Statistical Analysis with Missing Data,” 2nd Edition, Wiley and Sons, 2002; J. L. Schafer and J. W. Graham, “Missing Data: Our View of the State of the Art”, Psychological Methods, Vol. 7, No. 2, pages 147-177 (2002). For instance, the missing value patterns could be termed “Missing Completely At Random” (or MCAR) if it is assumed that the probability of a given record having missing values is that same for all records (that is, the pattern of missing values is completely independent of the remaining variables and factors in the data set, and as a result, excluding the data records with these missing data elements from the data set, as in the “record deletion” approach does not lead to any statistical bias from using this selection mechanism for the retained data records for the demand modeling analysis). Although the MCAR assumption may be tenable for certain root cause in retail sales data sets, it can be readily discerned that in most cases, the pattern of missing value depends on other observed factors within the data set, and the resulting missing value patterns are termed “Missing At Random” (or MAR). If either the MCAR or MAR assumptions are invalid, then the alternative, wherein the pattern of missing values may depend on unobserved factors, or even on the magnitude of the missing value itself, would be termed “Missing Not At Random” (or MNAR); this alternative is difficult to analyze and requires explicit modeling of the missing data mechanism.
A major methodological development in the prior art for missing data in statistical data sets is the use of multiple imputation, wherein multiple complete data sets are obtained, and wherein the missing values in the original data set take on a range of imputed values across these multiple complete data sets. Unlike the single imputation case in which there is only a single complete data set, the use of multiple imputation allows the randomness and variability of the missing data estimate to be captured for any subsequent statistical analysis; this statistical analysis can be carried out separately for each of the multiple complete data sets in the conventional way, and the results from these separate analyses can be suitably combined, and in this way to obtain more robust estimates for the model parameters and their standard errors than would be possible from a single complete data set. A description of multiple imputation may be found in D. B. Rubin. “Multiple imputation after 18+ years (with discussion).” Journal of the American Statistical Association, Vol. 91, pages 473-489, 1996. An important aspect of the methodology described in Rubin is that the number of the multiple complete data sets can be small, and typically between 3 to 5 complete data sets are sufficient for the subsequent statistical modeling.
It would thus be desirable to provide a system and method implementing machine-executable steps that address the missing values in the sales data sets, and that addresses several specific concerns and characteristics of the retail demand modeling application.
For example, one approach for handling missing values in the data sets for demand modeling analysis that is consistent with the prior art, is to use a standard off-the-shelf multiple imputation technique before carrying out the demand modeling analysis. For example, the “chained equation” approach described in T. E. Raghunathan, J. M. Lepkowski, J. Van Howeyk and P. Solenberger, “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models”, Survey Methodology, Vol. 27, No. 1, pages 85-95, 2001 is an advanced multiple imputation technique in the prior art for the MAR case, wherein a multivariate model is assumed for all the variables in the data set, and in particular, those variables with missing data fields are assumed to have some conditional distributional based on the other variables in the data. Since this dependency assumption can lead to cyclic dependencies between variables having missing values, the imputation procedure can sequentially iterate to compute the required missing values consistent with this assumed multivariate form. There are two difficulties with this approach: The first difficulty is that the form of conditional dependency between the unit-sales and the unit-price variables in the missing data imputation may be inconsistent with the form that is used in the subsequent demand modeling analysis, which may involve a more detailed set of factors and a more complex model function dependency between the response unit-sales variables and the covariate unit-price variables. The second difficulty is that the inclusion of the detailed and complex demand response relationship in the multiple imputation, makes it impossible to use existing “off-the-shelf” multiple imputation software (Y. C. Yuan, Multiple Imputation for Missing Data: Concepts and New Developments, Abstract P267-25, Proceedings of the Annual SAS Users Group International Conference 2000) which often only support very simple multivariate dependence models.
Even when more complex dependencies are supported, the resulting increase in the computational cost of the method steps makes the chained equation approach all but impractical for large and high-dimensional data sets found in retail applications.
It would be further desirable to decouple the imputation steps for the missing unit-price and the missing unit-sales values in the sales data set, which is specifically applicable for the missing data elements wherein both these fields are missing in the same data record corresponding to a specific (s, t) combination.