Internet, mobile communications, navigation, online gaming, sensing technologies and large scale computing infrastructures are producing large amounts of data every day. Big Data is data that is beyond the processing capacity of conventional database systems and analyzing capacity of traditional analyzing methods due to its large volume and fast moving and growing speed. More companies now rely on Big Data to make real-time decisions to solve various problems. Current methods involve utilizing a lot of computational resources, which are very costly, yet still may not satisfy the needs of real-time decision making based on the newest information, especially in the financial industry. How to efficiently, promptly and cost-effectively process and analyze Big Data presents a difficult challenge to data analysts and computer scientists.
Processing Big Data may include performing calculations on multiple data elements. When performing statistical calculations on Big Data elements, the number of data elements to be accessed could be quite large. For example, when calculating a kurtosis a (potentially large) number of data elements may need to be accessed.
The difference between processing live data stream and streamed Big Data is that when processing streamed Big Data, all historical data elements are accessible, and thus it may not need to create a separate buffer to store newly received data elements.
Further, some statistical calculations are recalculated after some data changes in a Big Data set. Thus, the (potentially large) number of data elements may be repeatedly accessed. For example, it may be that a kurtosis is calculated for a computation subset with a fixed size n that includes n data elements of a Big Data set stored in storage media. As such, every time two data elements are accessed or received, one of the accessed or received data elements is removed from of the computation subset and the other data element is added to the computation subset. The n data elements in the computation subset are then accessed to recalculate the kurtosis.
As such, each data change in the computation subset might only change a small portion of the computation subset. Using all data elements in the computation subset to recalculate the kurtosis involves redundant data access and computation, and thus is time consuming and is an inefficient use of resources.
Depending on necessity, the computation subset length n could be extremely large, so the data elements in a computation subset could be distributed over a cloud comprising hundreds of thousands of computing devices. Re-performing a kurtosis calculation on Big Data sets after some data changing in traditional ways results in slow response and significant waste of computing resources.