The present invention generally relates to stream data processing technology.
As can be seen from the small-lot stock trading in the finance industry, and the widespread use of RFID (Radio Frequency Identification) and sensors in the manufacturing/distribution industry, the quantity of data being handled in various industries has increased dramatically in recent years. In addition, in numerous cases the significance lies in making immediate use of the data that is being handled, as seen in securities trading in the finance industry and in the real-time tracing/monitoring of individual units in the manufacturing/distribution industry. For this reason, data processing systems that are capable of processing large quantities of data at high speed are required.
A stream data processing system has been proposed as a system for processing large quantities of data at high speed (for example, Japanese Patent Application Laid-open No. 2006-338432). The stream data processing system will be explained hereinbelow by comparing it to an ordinary database management system (hereinafter DBMS). Note that individual data comprising the stream data will be called a “data element”.
In a DBMS, data to be analyzed is stored in a secondary storage apparatus (for example, a disk device such as a hard disk drive) one time, and thereafter the data is collectively processed using a batch process or the like. By contrast, in a stream data process, the nature of the processing, such as totalization or analysis, is registered in the system as a query beforehand, and processing is carried out consecutively at the point in time at which the data arrives at the system's primary storage apparatus (for example, a volatile memory). In accordance with this, the stream data processing system is able to process a large quantity of consecutive time-sequence data that arrives from one moment to the next at high speed (hereinafter, stream data).
As an example of a query that is registered in the stream data processing system, a CQL (Continuous Query Language) may be cited (for example, Japanese Patent Application Laid-open No. 2006-338432). In addition to a processing definition, the CQL also specifies a range of data that is targeted for processing. This data range is generally called a “window”.
As methods of specifying the size of the window, there are a count specification and a time specification. For example, in the case of a count specification, a specification like “most recent 10 counts” is carried out, and in the case of a time specification, a specification like “most recent one hour” is carried out. That is, in the count specification, the number of data elements being targeted for processing is specified, and in the time specification, the time of the range of time stamps of the data elements being targeted for processing is specified.
The stream data processing system, as described hereinabove, consecutively processes data elements equivalent to the size of the window each time data arrives. It is desirable that a conventional load-leveling technique, such as a round robin or the like be capable of being applied to a stream data processing system like this as well.
However, it is not possible to simply apply a conventional load-leveling method to the stream data processing system. That is, it is not possible to realize load leveling by simply partitioning and distributing the stream data to a plurality of stream data processing servers (computers for executing a program that carries out stream data processing) either uniformly or in accordance with the throughput of the stream data processing server (shortened to “server” hereinafter).
An example in which a conventional load-leveling method is not able to be simply applied to the stream data processing system is processing in a case where a data range has been specified in accordance with a sliding window, in which endpoints are not specified for the stream data, and a landmark window, in which one endpoint is specified for the stream data (For example, Lukasz Golab and M. Tamer Ozsu, “Issues in Data Stream Management.” SIGMOD Rec, Vol. 32, No. 2, pp. 5-14, 2003).
Specifically, a case in which SUM processing is performed on the most recent 3 data elements, which data elements possess a value that is a natural number N (that is, N is an integer that is equal to or larger than 1), using a sliding window will be described below.
Processing is performed when a value “1” arrives in one server, and the operation result in the server is the value “1”. When a value “2” arrives in the server, the value “2” is added to the value “1” of the previous result, and the resulting value becomes “3”. When a value “3” arrives in the server, 3 is added to the value “3” of the previous result, and the resulting value becomes “6”. When a value “4” arrives in the server, the value “1”, which moves outside the sliding window range, is subtracted and, in addition, the arrived value “4” is added to the value “6” of the previous result, making the resulting value “9”. The server repeats the above-described difference processing each time data arrives, and the values of the operation results become “1”, “3”, “6”, “9”, “12”, “15”, . . . in that order.
By contrast, two servers will be used in an attempt to perform SUM processing by partitioning the stream data into individual data elements and mutually distributing these data elements. A value “1” arrives in a first server, processing is performed at this time and the resulting value is “1”. A value “2” arrives in a second server, processing is performed at this time and the resulting value is “2”. A value “3” arrives in the first server, the value “3” is added to the value “1” of the previous result at this time, and the resulting value becomes “4”. This difference in processing is alternately repeated in each server, and the values of the operation results become “1”, “2”, “4”, “6”, “9” . . . in that order, clearly differing from the results obtained via the processing carried out by a single server.