The present disclosure relates to methods for imputing missing data elements or values in data sets, generally, and retail data sets in particular, which are an important prerequisite for use in a variety of decision-support applications in a retail supply chain which decision-support applications are premised on the availability of complete relevant data with no missing data elements. More particularly, the present disclosure relates to a system and method for multiple imputation of missing data elements in retail data sets based on the multi-dimensional, tensor representation of these data sets.
Methods and structures for imputation of missing data elements in retail data sets is an important prerequisite for using these retail data sets in a variety of decision-support applications of interest to retail supply-chain entities such as consumer-product manufacturers, retail chains and individual retail stores; this prerequisite invariably arises since, in practice, decision-support applications require the relevant data sets to be complete with no missing values in them, whereas at the same time, it is often difficult or even impossible for various reasons to obtain such complete retail data sets. Examples of relevant decision-support applications include, but are not limited to, product demand forecasting, inventory optimization, strategic product pricing, product-line rationalization, and promotion planning.
Some retail data sets have a particular multi-dimensional structure and although this structure is common to many decision-support applications, it is often not explicitly specified or exploited in the method steps of the current modeling and analysis.
Two particular limitations of the prior art techniques that may be used for the imputation of missing data elements in retail data sets include: First, in the prior art, these missing data elements are typically replaced by certain point estimates for their relevant imputed values, and therefore, the complete data set resulting from this replacement does not capture the natural variability which would have resulted if these missing data elements had been actually recorded instead of being imputed, and as a consequence, this will lead to a statistical bias in any subsequent analysis using the complete data set; Second, the imputation procedures that are used in the prior art typically ignore any data correlations along the various data set dimensions, or may only consider these data correlations along a single dimension of the retail data set.
In a prior art embodiment of a retail sales data set that is commonly found in many decision-support applications, there is considered a time-series sequence of various specific quantities such as unit-sales, unit-prices, stock levels, delivery levels, unsold goods, discards, etc., for a specific time-period of interest, over a collection of products in a specified retail category of interest, and simultaneously over a collection of stores in the particular market geography of interest. For instance, in typical retail sales data sets, the typical time period for this reporting may be weekly, and data may be collected in a sequence of several months to several years over hundreds of products and stores.
In essence, therefore, these retail data sets have a multi-dimensional structure, with the specific quantities of interest mentioned above are measured and reported for a set of relevant products (whose elements are indexed by “p”), a set of relevant stores (whose elements are indexed by “s”), and the set of consecutive time-periods (whose elements are indexed by “t”), or equivalently, over a set of (p,s,t) combinations.
The use of multi-product and multi-store data, as described above, is of considerable value for any statistical analysis of interest in decision-support applications, even when, as is often the case, the specific focus of the statistical-modeling or decision-support application is confined to a single product, or to a small set of target products of interest. Specifically, even in this case, there may be examined data across multiple stores, or across the entire retail category, so that, for instance, while building statistical models, the data may be pooled across the stores to reduce the estimation errors for the model parameters. However, the inherent difficulty in acquiring this multi-dimensional data across the product, store and time-period dimensions invariably leads to these data sets having many missing data elements, which occur for specific combinations (p,s,t) of product “p”, store “s” and time-period “t” in the data set.
In the retail environment, the reason for the presence of missing data elements for a particular (p,s,t) combination, may be ascribable to a variety of reasons, such as certain privacy and confidentiality issues in acquiring relevant data elements, or what is more likely in practice, the presence of certain process errors in the data logging, reporting or integration required for the compilation and assembling of the required retail data set.
It would be highly desirable to provide multi-product, multi-store and multi-time period data sets for demand modeling, that addresses a pervasive limitation that arises, in this regard, due to the invariable presence of missing data records and missing data elements in the relevant sales data sets for specific combinations of product “p”, store “s” and time-period “t”.
There is now considered some of the limitations of the prior art for the handling, specification and imputation of the missing data elements.
Generally, the prior art for missing value imputation in data sets have been developed in the context of statistical analysis in the presence of missing data, as reviewed by R. Little and D. Rubin, “Statistical Analysis with Missing Data,” 2nd Edition, Wiley and Sons, 2002, and wherein, in general terms, the approaches are based on classifying the mechanism that is responsible for the pattern of missing values in the data sets. For instance, these missing value patterns would be termed “Missing Completely At Random” (or MCAR) if it is assumed that the probability of a given record having a missing data element is the same for all records (that is, the pattern of missing values is completely independent of the remaining variables and factors in the data set, so that excluding any data records with these missing data elements from the data set, as in the “record deletion” approach described below, does not lead to any statistical bias in the retained data records used for the demand modeling analysis). Although the MCAR assumption may be tenable for certain types of missing values in retail data sets, in most cases, the pattern of missing values will depend on other observed factors within the data set, and the resulting missing value patterns would be termed “Missing At Random” (or MAR). The remaining cases, wherein the pattern of missing values may depend on unobserved factors, or even on the magnitude of the missing value itself, are difficult to analyze and require explicit modeling.
One of the most common approaches in the prior art for handling missing data elements is to simply omit, ignore and exclude the entire set of data elements; however, for many statistical methods that require complete set of data elements for each data record that is used in the analysis, this approach is equivalent to deleting the entire record, which would even include many data elements that are non-missing. For instance, if the relevant record corresponded to the unit-sales for all the products in a given store, then the entire set of data elements would be excluded if the unit-sales for just a single product is missing; this is often referred to as the so-called “record deletion” approach in statistical analysis (equivalently, this is also referred to as the “complete case” approach). It can be readily seen that this “record deletion” approach leads to a significant reduction in the data set size, including the exclusion of valid and non-missing data elements in the retail data set which may have acquired at considerable effort and expense. Furthermore, it can also lead to significant statistical bias, as mentioned earlier, when the pattern of missing data elements depends on the values of the other data elements in the same data records, corresponding to the MAR case described earlier.
An alternative approach to “record deletion” that is also widely used in the prior art and does not have this deficiency of having to discard the entire record including the valid data elements, is termed “complete case” analysis, which in its simplest form consists of replacing the missing data elements in the sales data set by statistical estimates such as the mean value, either taken globally, or taken along some marginal dimension of the data set, and in this way to obtain a “complete” data set with the missing data elements filled in suitably. For example, a missing value for the data element corresponding to a certain (p,s,t) combination can be imputed by averaging the corresponding values over the other stores for the same (p,t) combination, or equivalently, across the store dimension, keeping (p,t) fixed. A similar approach can also be taken across the time dimension, that is, by averaging the corresponding values over time for the same (p,s) combination. However, this simplest approach of imputing the missing value by the replacing it by the corresponding mean value over the remaining non-missing data values along one or more dimensions of the data sets has the major disadvantage in that it deflates the variance and distorts the correlations for the measured quantity in the “complete” data set with these “mean-imputed” values.
More sophisticated methods for missing value imputation attempt to retain the naturally-occurring variance and correlation structures in the “complete” data set with the imputed values, and the most widely used approach is based on multiple imputation, as reviewed by J. L. Schafer, “Analysis of Incomplete Multivariate Data,” Chapman and Hall, London (1997), wherein instead of a single set of imputed values for the missing data elements, instead multiple data sets are created with each complete data set contains a representative sample for the missing values with any variability or noise “added back in,” and these multiple complete data sets are then used in subsequent analysis or decision-support procedures in suitable ways.
It would be highly desirable to provide an improved method for the specification or imputation of missing data elements in the retail data sets.