High quality data is critical for the success of predictive analytics both in the development of analytic models and their successful deployment and production use. Data understanding is the first step in development of predictive analytic models, and is critical to their success. This process can be time-consuming and can take longer than actual model development or software deployment of the model.
For analytic model development, data understanding involves discovering which data elements and detailed relationships in values of those data elements have predictive power towards the desired analytic decision. Data elements must be correctly collected, target values must be validated, and subpopulations understood. It is important that a data scientist be able to quickly inspect data and perform additional analyses to look for patterns, anomalies and investigate data integrity. Even more important is that the system itself automatically determines patterns in the data not prescribed by the analytic scientist.
Given that data understanding and integrity is a key component of developing analytics, this stage is critical to the development of meaningful predictive analytics. Typical data sets include examples from many subpopulations, each of which may have very different characteristics. A first look at the statistics of a data element may reveal multi-modality or apparent anomalies, and will motivate further questions. Multivariate analysis can then reveal if the issues are specific to certain populations or segments. Other questions about the data include: “how have these data elements changed between this month and last”, and in a data consortium, “how does one client's data differ from another?”, “why is a particular subpopulation accelerating further way from another?”, or “why is a population's behavior diverging from past historical behavior in a short span of time?”, etc. The faster such questions can be asked and answered, the more insight the data scientist can gain to build high quality predictive models and avoid spurious or non-representative learning in models.
Understanding higher dimensional data is a challenging problem because it is computationally intensive and difficult to visualize more than three dimensions. Using a simple technique of binning data element ranges (Cartesian product of element values), many bins may be provided that have very few counts, and it is difficult to get stable estimates of distributions or outliers. For elements that have many possible bins, the number of bins required for multivariate binning becomes intractable. For example, three variables, each with 100 bins, would require 1 million bins for the multi-dimensional analysis, and likely most of those bins would not have enough values to provide statistically sound estimates.