Many enterprises and organizations store and process big data in data storage devices, such as relational databases, examples of which include Oracle®, Sybase®, SAP HANA® databases. They also may have data residing in distributed data storage systems such as Amazon® and Google® cloud storage systems, or in computational clusters such as Hadoop®. These silos of disconnected data clusters are typically called data lakes or data swamps. Moreover, data can be structured or unstructured and can be from different domains such as financial, manufacturing, product master data, etc.
Businesses analyze data to derive business strategies and to make sound business decisions. Data needs to be correlated and combined across data nodes to form a more complete set of information. This incoming stream of data and continuous correlation of data allow analysts to monitor business activities and alter business plans when necessary.
Data can be curated, cleansed, and transformed (collectively referred to in this disclosure as “data processing”) before it can be analyzed or used in a meaningful way. The most effective way is to process data in close proximity to where the data and corresponding data processing resources are stored. For instance, execution of data in relational databases is performed in the databases themselves with, for example, structured query language (“SQL”) scripts. In this manner, data in distributed data storage system like Amazon's S3® and Google Cloud Storage® should be processed in Amazon's EC2® and Google's Cloud Computing Engine® respectively.