Use of computing devices and software is enabling advanced analytics to be performed on data sets from various sources, such as delimited files, database connectors, or in-memory tables. Data sets from different sources may be structured according to different schemas. For example, first data from a first source may be structured according to a first schema and second data from a second data source may be structured according to a second schema. The schemas may describe different logical structures for the respective data. For example, a schema may describe names, ordering, and data types of data fields (e.g., columns). A computing device that performs the advanced analytics may be unable to recognize or to process data structured according to a schema that differs from a schema that is “known” to the computing device.
To address schema changes, some organizations may dedicate developer resources to program a new data input job (e.g., an extraction, transformation, and loading (ETL) tool) each time the schema used by a data source changes. Because data processing rules and logic may typically be written against a source or a target schema, new data input jobs may need to be created any time a source schema or a target schema changes. Rules and logic that are written for a specific source schema or target schema may not be applicable or reusable for other schemas, even when those other schemas represent data expressing the same problem. In addition, a change to a source schema may result in the inability to load and analyze old source data that used an earlier schema, which can limit the ability to perform analytics that utilize comparisons to historical data. Custom data loading/translation tools can also be slow to execute and unsuitable for execution on multi-processor or other parallel computing architectures.
In some cases, data sources may include inaccurate or old data. The custom data loading/translation tools described above may correct errors by altering the data source, but doing so may result in loss of the original data (and, by extension, data auditing ability) and may also require that the corrections be determined anew each time the incorrect data is loaded. For example, if new data is received at the end of each month for the preceding three months, then an error occurring for data corresponding to April 15th may need to be detected and corrected three times: in the data sources (e.g., files) received April 30th, May 31st, and June 30th.