Internet, mobile communications, navigation, online gaming, sensing technologies and large scale computing infrastructures have produced large amounts of data sets every day. Big Data is data beyond the processing capacity of conventional database systems and analyzing capacity of traditional analyzing methods due to its large volume and fast moving and growing speed. More companies now rely on Big Data to make real-time decisions to solve various problems. Current methods involve utilizing a lot of computational resources, which are very costly, yet still may not satisfy the needs of real-time decision making based on the newest information, especially in the financial industry. How to efficiently, promptly and cost-effectively process and analyze Big Data presents a difficult challenge to data analysts and computer scientists.
Streamed data is data that is constantly being received by a receiver while being delivered by a provider. Streamed data may be real-time data gathered from sensors and continuously transferred to computing devices or electronic devices. Often this includes receiving similarly formatted data elements in succession separated by some time interval. Streamed data may also be data continuously read from storage devices, e.g., storage devices on multi-computing devices which store a Big Data set. Stream processing has become a focused research area recently due to the following reasons. One reason is that the input data are coming too fast to store entirely for batch processing, so some analysis have to be performed when the data streams in. The second reason is that immediate responses to any changes of the data are required in some application domains, e.g., mobile related applications, online gaming, navigation, real-time stock analysis and automated trading, etc. The third reason is that some applications or electronic devices require stream processing due to their nature, e.g., audio, video and digital TV, etc.
Processing streamed data may include performing calculations on multiple data elements. Thus, a computing device receiving a stream of data elements typically includes a buffer so that some number of data elements may be stored. Processing the streamed data elements may include accessing data elements stored in the buffer. When performing statistical calculations on streamed data elements, buffer requirements may be quite large. For example, when calculating simple linear regression a (potentially large) number of data elements may need to be accessed.
In addition, algorithms on streamed data processing may be extended to Big Data processing, because Big Data sets are accumulated over time and may be considered as data streams with irregular time intervals.
For Big data set or streamed data processing, some statistical calculations are recalculated as a Big Data set is changed or existing streamed data elements are removed. Thus, the (potentially large) number of data elements may be repeatedly accessed. For example, it may be that simple linear regression coefficients are calculated for a computation set with n pairs of data elements and an input comprising a pair of data elements tells which pair of data elements is removed from the computation set. As such, every time a pair of data elements (one data element from each variable) is accessed or received, one pair of the data elements is removed from the computation set. The all 2n−2 data elements in the computation set are then accessed to recalculate the simple linear regression coefficients.
When performing simple linear regression coefficient calculation on all 2n−2 data elements, all the 2n−2 data elements in the computation set will be visited and used. As such, each pair of data elements in the computation set needs to be accessed for recalculating the simple linear regression whenever there is a change in the computation set. Depending on necessity, the computation set size n could be extremely large, so the data elements in a computation set could be distributed over a cloud comprising hundreds of thousands of computing devices. Re-calculating simple linear regression coefficients on Big Data or streamed data after some data changes inefficiently uses time and computing resources.