The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for efficient data retrieval in big-data processing systems.
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating, and information privacy. The term “big data” often refers simply to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.
Analysis of data sets may find new correlations to “spot business trends, prevent diseases, combat crime, and so on”. Scientists, business executives, practitioners of medicine, advertising, and governments alike regularly meet difficulties with large data sets in areas including Internet search, finance, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology, and environmental research. Massive open online courses (MOOCs) also bring big-data challenges as the courses reuse the same data sets for the students projects.
Big-data processing systems analyze big-data sets at terabyte or even petabyte scale. Offline batch data processing is typically full power and full scale, tackling arbitrary time series fact use cases. While real-time stream processing is performed on the most current slice of data for data profiling to pick outliers, fraud transaction detections, security monitoring, etc., the toughest task however is to do fast (low latency) or real-time ad-hoc analytics on a complete big data set, which practically means that terabytes (or even more) of data has to be scanned within seconds. This is only possible when data is processed with high parallelism, such as that used in big-data processing systems.