Internet, mobile communications, navigation, online gaming, sensing technologies and large scale computing infrastructures have produced large amounts of data sets every day. Big Data is data that is beyond the processing capacity of conventional database systems and analyzing capacity of traditional analyzing methods due to its large volume and fast moving and growing speed. More companies now rely on Big Data to make real-time decisions to solve various problems. Current methods involve utilizing a lot of computational resources, which are very costly, yet still may not satisfy the needs of real-time decision making based on the newest information, especially in the financial industry. How to efficiently, promptly and cost-effectively process and analyze Big Data presents a difficult challenge to data analysts and computer scientists.
Processing Big Data can include performing calculations on multiple data elements. When performing statistical calculations on Big Data, the number of data elements to be accessed may be quite large. For example, when calculating simple linear regression coefficients a (potentially large) number of data elements may need to be accessed.
The difference between processing live data stream and streamed Big Data is that when processing streamed Big Data, all historical data elements are accessible, and thus it may not need to create a separate buffer to store newly received data elements.
Further, some statistical calculations are recalculated after some data changes in a Big Data set. Thus, the (potentially large) number of data elements may be repeatedly accessed. For example, it may be that simple linear regression coefficients are calculated for a computation set and the computation set includes n pairs of data elements of a Big Data set stored in storage media. As such, every time an existing pair of data elements (one data element from an independent variable and the other from a dependent variable) to be removed from the computation set and a pair of data elements to be added to the computation set are accessed or received, the to-be-removed pair of data elements is removed from the computation set and the to-be-added pair of data elements is added to the computation set. All 2n data elements in the computation set are then accessed to re-estimate simple linear regression coefficients.
As such, each pair of data elements remains in the computation set for n simple linear regression coefficient calculations before it is aged out of the computation set. Accordingly, each pair of data elements is read from the buffer and used n times. Performing statistical calculations on streamed data elements this way is time consuming and is an inefficient use of resources. When performing simple linear regression coefficient calculation on all 2n data elements all the 2n data elements in the computation set will be visited and used.
Depending on necessity, the computation set size n may be extremely large, so the data elements in a computation set may be distributed over a cloud comprising hundreds of thousands of computing devices. Re-performing simple linear regression coefficient calculation in traditional ways on a Big Data set after some data changes results in slow response and significant waste of computing resources.