In present scenario, the applicability of data processing technologies can be observed very commonly in a number of applications. Typically, in data processing a sender saves a certain data into a file in a certain format, and then sends the file to a recipient. Upon receiving the file, the recipient analyzes content within the file, and performs logical processing accordingly. In the recited data processing procedure, if the file is not too big, and the recipient does not have a strict processing time requirement, a single server or a single thread can used for processing. Under the given circumstance, corresponding system may still operate normally, though the time taken by the recipient to process the data of these files may be quite long.
However, some other scenarios of data processing could be a sender wishes to send a huge file or if the number of files needs to be sent is large. In addition to this, if the recipient has a very strict processing time requirement e.g., the recipient may require the data of the file transmitted from the sender to be processed in a shorter period. Under these conditions, the processing system using single server or single thread may not be able to satisfy the data processing needs.
In the above-mentioned conditions, the data to be processed may exceed the capacity of conventional processing systems and/or the processing speed required is too high that it cannot be fulfilled by the conventional data processing methods and systems. The data that need to be processed in these scenarios is conventionally termed as big data.
At present, the commonly used frameworks for big data processing are MapReduce® and its open source implementation from Hadoop®. In these frameworks, a computing task can be executing on a large set of nodes, as long as it is been expressed as a sequence of Maps (independent computations on subsets of the input data) and Reduce (merge of Map results).
Currently big data processing advantages distributed computing clusters using Hadoop Ecosystem® components or niche distributed grid-computing components. All of these components expect developers to code MapReduce® programs to develop the data transformation using Hadoop® or similar programming models for other systems. Some of the existing database product companies provide support for connecting to database from the Hadoop Ecosystem® using some form of native connectors. Again, these connectors have to program by developers to leverage the connectivity and capability. In addition, some of the existing data integration products like Pentaho® provide support for Hadoop® data integration with existing enterprise data. In addition to it, Hadoop® Howeveris only applicable for batch processing but it does not addresses the real time processing needs.
None of the existing big data processing platforms provides a unified model for real-time and batch processing of data. In addition, none of the existing techniques includes pre-build adapters for pre-build adapters for performing routine processing, data transformations, machine learning, and analytics on the Hadoop® platform. Additionally, there is a strong need of big data processing systems that can provide one-click big data cluster setup right from the infrastructure provisioning. Another important need of big data processing systems is to provide certain big data services like recommendation engine. It is also required that the business insight derived from a big data processing system should be secured and only available to the authorized users.