In typical machine learning settings, a frequent issue that often prevents effective leverage of data is a shortage of initial data available to build or execute a model. When the available data is limited, or does not contain enough information to satisfy threshold criteria for use in machine learning or data modeling, the data may be disregarded or otherwise ignored at least until additional data is made available. As an example, a recommendation engine may not compute a recommendation for a user until at least a threshold amount of information is collected from the user, such as after the user interacts with a certain amount of content. However, even minimal amounts of data may be useful in making recommendations and can be leveraged in certain useful ways.
In prior solutions to this common problem of data shortage, researchers may assume that missing values are distributed similarly to the values that are present. In this case, the missing values may be replaced with the mean of the values that are present for that feature. This assumes that feature values are missing completely at random (MCAR).
Another solution that follows from the MCAR assumption is to replace the missing values with the median, or in some cases, replace the missing values with the most commonly occurring value, the mode.
Another approach has been to replace missing values with a constant. This approach assumes that missing values are not at random (MNAR), and that they are missing because of what the value should be. However, for many data features, this is not a viable assumption.
In many cases, these assumptions, and the added values that follow from these assumptions, affect performance of the model and can severely distort the distribution for the variable. Furthermore, mean imputation distorts the relationship between variables and underestimates the standard deviation, and further ignores any heterogeneity in the data records.
While implementations are described herein by way of example, those skilled in the art will recognize that the implementations are not limited to the examples or drawings described. It should be understood that the drawings and detailed description thereto are not intended to limit implementations to the particular form disclosed but, on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.