Before data can be used in information management programs like data warehouses, master data management (MDM) or Big Data analysis programs, many steps are necessary for integrating raw data from a plurality of internal and external sources into a consolidated data repository in a format which can be consumed in a meaningful way by end-users. First, the data sources containing all the information necessary for a particular task need to be identified. Thus, a user needs to know the semantic content of the available data sets, e.g. by manual inspection or by manually triggering the execution of semantic data profiling tools on the available data sets. The user may start a data profiling project and incorporate sources he thinks are relevant. However, said steps already require the user to know which sources should be analyzed. Interesting data sources may be missed. In addition, the user has to spend time and effort to get used to the available data sets and tools as he or she needs to know which kind of analysis tools require which kind of data format.
Data integration may further be complicated by the fact that some data sets may comprise confidential information which should not be presented to the end-user or some groups of end-users. Ensuring and increasing the data quality of the available data sets may also be an issue: data may be stored redundantly in the original data sets, may comprise inconsistent information on some data records, or may be presented in different data formats and standards.
In the prior art, a plurality of products and approaches exist that can fulfill some of the above requirements, but said tools rely either on the manual control and configuration of the user or on a predefined and fixed workflow schema. A user or the workflow schema need to explicitly specify which one of the tools have to be applied on which one of the data sets at what moment in time in order to solve a particular problem. Manual data pre-processing and profiling approaches can only be used in situations when the amount of data to be integrated is small and is of comparatively low complexity. Predefined, workflow-based data processing approaches require a fixed sequence of data sets to be processed, whereby the syntax and content of said data sets is known in advance. Such data is often called structured data, both in connection with workflow-based data processing and otherwise.
In a Big Data environment, however, huge amounts of data need to be integrated and processed, and neither the content, nor the syntax, nor the sequence nor the file format of the data to be integrated may be known in advance. Such data not limited to data sets where the syntax and content is known in advance is often called unstructured data. It may not be possible to foresee if and when a particular data set may be available. Manual approaches cannot be applied as humans are not able to cope with the complexity and dynamicity of the data processing tasks involved. Approaches which rely on predetermined workflows are also not applicable as it is not possible to foresee the kind and sequence of all the data preprocessing, profiling and analysis steps which may be necessary for integrating and processing dynamically provided new data. Thus, neither manual nor workflow-based approaches are able to cope with the amount, structural and semantic heterogeneity, and unpredictability of the data to be handled by a Big Data environment.
US006381556 B1, for example, discloses a method for preparing raw data coming from a manufacturing environment in order to load said data for reporting purposes. The presented approach is rather static similar to an ETL job. US006643635 B2 describes a transforming of data for business analysis in an automated way for reading and preparing data from disparate data sources based on a static data processing schema.