Internet, mobile communications, navigation, online gaming, sensing technologies and large scale computing infrastructures are producing large amounts of data sets every day. Big Data is data that is beyond the processing capacity of conventional database systems and analyzing capacity of traditional analyzing methods due to its large volume and fast moving and growing speed. More companies now rely on Big Data to make real-time decisions to solve various problems. Current methods involve utilizing a lot of computational resources, which are very costly, yet still may not satisfy the needs of real-time decision making based on the newest information, especially in the financial industry. How to efficiently, promptly and cost-effectively process and analyze Big Data presents a difficult challenge to data analysts and computer scientists.
Streamed data is data that is constantly being received by a receiver while being delivered by a provider. Streamed data may be real-time data gathered from sensors and continuously transferred to computing devices or electronic devices. Often this includes receiving similarly formatted data elements in succession separated by some time interval. Big Data sets are accumulated over time and they may be considered as a data stream with irregular time intervals. Streamed data may also be data continuously read from storage devices, e.g., storage devices on multi-computing devices which store a Big Data set.
Stream processing has become a focused research area recently due to the following reasons. One reason is that the input data are coming too fast to store entirely for batch processing, so some analysis have to be performed when the data streams in. The second reason is that immediate responses to any changes of the data are required in some application domains, e.g., mobile related applications, online gaming, navigation, real-time stock analysis and automated trading, etc. The third reason is that some applications or electronic devices require streaming processing due to their nature, e.g., audio, video and digital TV, etc.
Processing streamed data may include performing calculations on multiple data elements. Thus, to process streamed data, a system comprising one or more computing devices typically includes a buffer on one or more storage media for storing some number of streamed data elements received by the system. Processing the streamed data elements may include accessing data elements stored in the buffer.
When performing an autocorrelation function calculation on streamed data elements, buffer requirements may be quite large. For example, when calculating an autocorrelation function a (potentially large) number of data elements may need to be accessed.
Further, some statistical function calculations are recalculated as new streamed data elements are accessed or received. Thus, the (potentially large) number of data elements may be repeatedly accessed. For example, it may be that an autocorrelation function is calculated for a computation window that includes the last n data elements in a data stream. As such, every time a new data element is accessed or received, the new element is added to the computation window and the current nth data element is moved out of the computation window. The n data elements in the computation window are then accessed to recalculate the autocorrelation function.
As such, each data element remains in the computation window for n autocorrelation function calculations before it is aged out of the computation window. Accordingly, each data element is read from the buffer n times. When performing an autocorrelation function on n data elements all the n data elements in the computation window will be visited and used at least once at a given lag. Performing autocorrelation calculations on streamed data elements in this way is time consuming and inefficient.
Depending on necessity, the computation window size n may be extremely large, so the data elements in a computation window may be distributed over a cloud comprising hundreds of thousands of computing devices. Re-performing an autocorrelation function calculation in traditional ways on streamed data results in slow response and significant waste of computing resources.