The present invention relates generally to data processing, and more particularly to computing time-decayed aggregates in data streams.
Statistical analysis of data is a core process for characterizing and controlling systems. In many applications, large volumes of data are generated from multiple data sources as multiple data streams, in which data is updated frequently. In some instances, the updates may be considered to be continuous, or near-continuous. In an industrial application, for example, sensors may provide real-time measurements of process variables such as position, velocity, acceleration, temperature, pressure, humidity, and chemical concentration to a monitoring and control station. In a financial application, multiple order-entry systems may provide near real-time updates of stock prices to a central transaction system. A major application is transport of data across a packet data network. E-mail, instant messaging, file transfers, streaming audio, and streaming video applications may generate large streams of data from multiple data sources, such as personal computers and web servers, across a packet data network. Network operations, administration, maintenance, and provisioning (OAM&P) require accurate characterization of data streams. Network performance and reliability, for example, depend on the traffic capacity of the network infrastructure equipment (such as routers, switches, and servers), on the traffic capacity of the communication links between network infrastructure equipment, and on the network architecture.
In some applications, data may be captured, statically stored in a database, and post-processed. In other applications, real-time, or near real-time, analysis is required. For example, if data traffic to a specific router is becoming excessive, new data traffic may be dynamically re-directed to another router. As another example, if an excessive number of users are accessing a web server, new users may be dynamically re-directed to a mirror server. In applications such as real-time control, the most recent data may have the highest relevancy. Particularly when the data streams are large, selectively filtering the most recent data for analysis reduces the required computational resources, such as processor speed and memory capacity, and computational time.
Commonly, what constitutes the most recent data, for example, is determined by the arrival time of the data at the network element (data receiver) which collects the data. The underlying assumption is that the time order in which the data arrives at the data receiver is the same time order in which the data sources generated the data. In applications such as transport of data across a packet data network, however, this assumption may not hold. For example, if data is generated by multiple sensors and the data is transported across a packet data network to a single monitoring and control station, the data from each sensor may be transported across different routes. The delay across one route may differ from the delay across a different route. In general, the delay across a specific route may be a function of overall data traffic across that route. If the overall data traffic is variable, the delay may also be variable. Consider the example in which data from sensor 1 is generated before data from sensor 2. At a particular instance, the data from sensor 1 may arrive at the monitoring and control station ahead of the data from sensor 2. At a later instance, however, under a different set of network conditions, the data from sensor 2 may arrive ahead of the data from sensor 1.
Even if the data is generated by a single data source, the data may arrive at a data receiver out-of-order. In a packet data network, user data may be segmented into multiple data packets. Depending on the configuration of the packet data network, there may be multiple routes between the data source and the data receiver. As discussed above, the delay across one route may differ from the delay across a second route. Consider the example in which data packet 1 is generated before data packet 2. If the two data packets are transmitted across different routes, and if the delay across the route for data packet 1 sufficiently exceeds the delay across the route for data packet 2, then data packet 2 may arrive before data packet 1.
Statistical properties of data streams are characterized by aggregate statistical values (which are referred to herein simply as aggregates), such as the average number of packets per unit time or the quantile distribution of the number of packets per unit time. Calculating aggregates from large volume unordered data streams may be computationally intensive. Herein, an unordered stream is a data stream in which the age of the data and the time order of the data are not taken into account. If the age (recency) of the data and the time order of the data are of significance, then, in general, calculating aggregates requires additional computational resources and additional computational time. What are needed are method and apparatus for efficiently calculating age-dependent aggregates from large volume data streams in which the data may be received in arbitrary time order.