The present disclosure pertains to the field of data processing, particularly with regard to data structure design, machine learning, and predictive modeling. The novel aspects of this disclosure are equally as applicable to techniques used in the fields of academic research and administrative agency studies as they would be in the commercial context. The present disclosure represents an improvement to the field of data science.
In recent years, the generation of enormous pools of data has presented an unprecedented opportunity for applying computing power to the task of creating ever more accurate models of the physical world in which we live. However, the enormous depth and complexity of the data sets generated, colloquially referred to as “Big Data”, has proven to render traditional data processing applications inadequate.
The challenges presented by Big Data can broadly be separated into two categories: tasks related to data management and maintenance and tasks related to data analysis and interpretation. These tasks are generally addressed by two separate and distinct groups of specialists: software engineers, primarily addressing data management and maintenance, and data scientists, focused on analysis and interpretation. Software engineers generate applications to curate the lifecycle of data; from receiving the data, to cataloging, addressing and storing the data within data structures, to retrieving data in response to a query or search, and eventually removing data when it is no longer accurate or valuable. Data scientists, on the other hand, can generate predictive models by applying tools, such as predictive modeling and machine learning applications, which can be useful in, for instance, assessing risks of adverse future outcomes based on current conditions. It is anticipated that demand for predictive modeling will only continue to rise.
The predictive models generated by data scientists can take many forms, including algorithms, matrices, and equations. Predictive modelling may include applying a variety of statistical techniques for identifying possible future outcomes. While the precision of predictive models can depend greatly on the complexity of the model and the number of parameters included, the accuracy of these predictive models are highly dependent on the initial sample size of applicable data that the model is applied against. This generally requires access to large stores of data, but it also requires that the data be structured and formatted to be accessible by the software tools utilized by the data scientists, which often require tabular or comma-delineated data sets.
From the software engineering perspective, satisfying the requirements of the data scientist creates additional demands on their data curation role. Though the data is unquestionably valuable to the data scientist for generating their predictive models, data is often primarily maintained for other record-keeping or warehousing purposes, and the structures in which that data is stored is generally geared toward those other purposes. Data warehousing is generally focused on maintaining data with a high degree of detail and granularity to ensure that the stored data is faithful to its source.
The structures in which data is stored or warehoused may have a significant impact on how effectively that data can be accessed in response to particular types of queries. While certain data structures, such as highly nested, hierarchical structures, are better suited for long term storage of raw data relating to distinct and separate records, other data structures, such as comma-delineated files and tables, are better at facilitating rapid access of features in the data for research and analysis.
The competing design requirements between the data warehousing and data analysis paradigms tends to create a bottleneck in the continued development of predictive modeling using Big Data sets. Producing a system which incorporates a solution suitable both for the needs data scientists have in testing and deploying their models and which also satisfies the record-keeping requirements of software engineers generally demands a large collaborative effort between data scientists and software engineers, and generally results in a bespoke solution which has little applicability, if any, to predictive models other than the one for which it is designed, requiring a new solution to be developed for each new model.
Complicated predictive models typical require leveraging a large engineering effort and deep technical expertise to provide the necessary data to test, verify and host the model. When attempting to do predictive modeling at scale, data scientists encounter problems relating to the data structure in which the data is warehoused.
Predictive models generally must be verified in order to demonstrate their value. Verification involves populating the model with empirical data, running the model to produce predictions, and comparing the predicted outcomes with empirically determined true outcomes. The verification process may be applied in an open-ended manner, wherein a large number of disparate features are analyzed for their comparative predictive power.
Predictive modeling/machine learning protocols, such as predictive model markup language (PMML), typically receive data library files in the form of a table or comma-delimited file. Data warehoused in a star-schema or a more sophisticated, domain-specific structure is not directly usable by machine learning protocols, not only because representing data to such high fidelity can make analysis of the data unwieldy but also because data warehousing structures, such as star-schema or complex hierarchical structures, are distinctly different from the data structures that these machine learning protocols are generally designed to rely on. More specifically, a number of machine learning protocols commonly used by data scientists to generate predictive models rely on calls to data stored in a tabular format or as a comma-delineated list. As such, data which is warehoused in a complex data structure, such as a star-schema, must first be converted to a table format or must be otherwise extracted from the data structure with a specialized data call. Additionally, as the high degree of fidelity maintained in the warehouse data may add complexity without providing additional, useful details which would affect the predictive model, the extraction process may be designed to be lossy as to extract only the features which are impactful. Performing such operations for each predictive model on a case-by-case basis can be a time and resource intensive project, particularly as reliance on predictive modeling continues to increase. Processing calls for warehoused data into active memory each time a model is run can take up a significant portion of the total processor load of the model, using up CPU time.
Further, representing warehoused data only as a static table is valuable but limiting, as it does not allow for the models to take advantage of updates in the warehoused data. This is a particular disadvantage for predictive modeling, since comparison of predicted outcomes to updates in the data can be useful in verifying the veracity of the predictive models. Additionally, after the veracity of a model is demonstrated, the greatest value of that model will likely come from applying it to updated information to make future predictions.
Pulling data from the data warehouse or distributed big data systems, such as Adobe Hadoop®, into the predictive models generally requires customized extraction procedures, requiring an enormous engineering effort to implement. Such tailored solutions require an intimate knowledge of both the data structures and the predictive models. As such, they are generally the result of an exhaustive collaborative effort between the software engineers and the data scientists. Additionally, as the models develop, the demands for features may change, so the collaborative effort continues through all stages of development.