The present invention relates to a system for data processing.
The area of business data has seen rapid growth in the last few years. Increased computing power combined with availability of an expanding set of analytics tools and services has enabled companies to turn large amounts of data into actionable information that will allow better decision making. This has benefits in any areas that involve the analysis of large quantities of data, such as weather forecasting, financial modelling, complex physical simulations, the modelling of chemical reaction kinetics, human population studies and so on.
Extract, Transform and Load (ETL) is one traditional process used in relation to gathering and storing data using databases, particularly, relational databases. Such relational databases are traditionally concerned with content (i.e. the values in the data and what those values mean) and the relationships and structures within that content. As such, relational databases can have a high design requirement and low reactivity. In most traditional ETL methods, questions to be asked of a data set are specified first and the content based storage schema is designed for those specific questions. For the originally specified questions, queries can be performed relatively quickly but the scope of queries is limited due to the specific design of the storage scheme.
As the amount of data available increases and becomes more complex in structure, the ETL approach described above is proving unable to meet the evolving demands of users, in particular business users who require agility and speed from their data services providers. It is difficult and time consuming to map new, complex and changing data structures on to pre-existing ETL models. The mappings and queries that were defined at the beginning of the process become swiftly out of date as new complex data is required to be processed and this calls for another new and expensive ETL design process to be undertaken.
So called ‘Big Data’ or ‘NoSQL’ solutions can be considered as an alternative to the ETL process. Data can be collected as it comes and stored in an unstructured manner. Specific questions are not considered at the start and no structure is added when the data is recorded. Structure may then be added when a question is devised. The data can then be structured according to the question being asked of the recorded data. Querying such a big data set is slower as a structure has to be setup first. Increased speed can be brought to this process through a distributed data processing solution. However, building a distributed, scalable and robust big data processing system is a new and complex field.
Thus, there is a need for an improved system for processing data.