There is an increasing demand for a service for collecting and using big data, i.e., a large amount of data provided from, for example, various information sources, apparatuses, and sensors connected to a network. If a large amount of data generated in the real world can be sequentially processed, it is possible to obtain information in almost real time. For example, there is a demand for a technology that sequentially process a large number of data streams that are constantly provided from various sensors.
An example of such a technology is complex event processing for processing big data. However, the recent spread of smartphones and tablet terminals has drastically increased communication traffic. Also, as more and more people and apparatuses become connected to networks, the communication traffic is expected to increase further. Accordingly, it is necessary to further develop such technologies.
Data (a sequence of events) obtained from a data stream may be stored in a database before extracting information from or processing the data. However, this approach may not always satisfy the need to easily obtain desired information in real time. Accordingly, there is a demand for a technology that can process and analyze a large number of data streams (or a large data stream) in real time (or in almost real time). Also, to meet this demand, a technology for processing data streams in parallel is necessary.
A data stream includes multiple events. Therefore, in the present application, a data stream may also be referred to as an “event sequence”.
FIG. 1 illustrates an example of data stream processing. In the example of FIG. 1, a stream processing system 140 sequentially processes three input streams 110, 120, and 130, and outputs two output streams 150 and 160. For example, in the input stream 110, multiple events 111, 112, and 113 are sequentially input to the stream processing system 140.
The stream processing system 140 includes multiple queries 142, 144, 146, 148, and 149. These queries are similar to queries used for processing of a static database. However, queries for a stream processing system are different from queries for a database in that they continuously process input information and output desired information. Also in a stream processing system, an output of a query is used as an input to another query. This is another difference of queries for a stream processing system from queries for a database. Accordingly, a “query” in the present application may additionally include a function that is different from a query for a database.
In FIG. 1, queries are connected by arrows. These arrows indicate data flows (data streams). For example, the output stream 150 output from the stream processing system 140 includes multiple processing results 151 and 152. In the present application, a graph indicating connections among the queries in the stream processing system 140 is referred to as a “query graph”. Also in the present application, a program including a group of queries and a relationship among the queries indicated by a query graph is referred to as a “data stream program”.
A data stream program is written in a query language similar to a Structured Query Language (SQL) used for static databases. Examples of data stream program languages include a Continuous Query Language (CQL) (see, for example, Arasu, Arvind, Shivnath Babu, and Jennifer Widom. “CQL: A language for continuous queries over streams and relations” Database Programming Languages. Springer Berlin Heidelberg, 2004; http://link.springer.com/chapter/10.1007/978-3-540-24607-7_1) and a Complex Event Processing (CEP) Language (see, for example, Interstage Big Data Complex Event Processing Server V1.0.0 Developer's Reference; http://software.fujitsu.com/jp/manual/manualfiles/m120021/j2u11668/01enz200/j2ul-1668-01enz0-00.pdf). In the present application, the Complex Event Processing (CEP) Language is used for descriptions.
In FIG. 2A, two queries Q1 and Q2 are connected by an intermediate stream 240. The queries Q1 and Q2 process an input stream 210, and the query Q2 outputs an output stream 270. The query Q1 includes partitioning keys A and B, and the query Q2 includes partitioning keys B and C.
Here, a partitioning key is a key to be applied to a hash function used to determine destination nodes of data when an input stream is partitioned for parallel distributed processing. For example, in a query with a group by operator, the key used for the group by operator may be used as the partitioning key for that query. When, for example, a program of the query Q1 includes a clause “group by A,B”, event fields A and B are recognized as the partitioning keys of the query Q1. When multiple fields are recognized as partitioning keys, a set of partitioning keys is referred to as a “partitioning key set”. Similarly, in a query with a join operator, the join key(s) used for the join operator may be used as the partitioning key(s) of that query.
Further, in the present application, a function indicating a relationship between properties of an input event and an output event of a query is also treated as a partitioning key. Here, a property indicates an attribute of data belonging to an event. An event has one or more properties. Also, a property may be used as a partitioning key.
FIG. 2B illustrates an example where the data stream program of FIG. 2A is executed in a parallel distributed manner to process the input stream 210.
The input stream 210 is expressed in a format similar to a table used for a database. The input stream 210 has multiple properties {A,B,C}. Also, the input stream 210 includes multiple events 212, 214, 216, and 218 that are arranged in time series. The query Q1 is assigned to each of a node 232 and a node 234. Here, a node may indicate, for example, a physical machine or a virtual machine. In this example, distributed processing of the query Q1 is performed by two nodes 232 and 234. At a point 220, to distribute the input stream 210 to the node 232 and the node 234, the input stream 210 is partitioned into a stream 221 and a stream 222 by applying a partitioning key set {A, B} to an appropriate hash function. The stream 221 includes an event 212a and an event 214a that sequentially arrive at the node 232 and are processed. The stream 222 includes an event 216a and an event 218a that sequentially arrive at the node 234 and are processed. As the hash function, a technology for a static database may be used. For example, a hash table may be used. In this case, various hash functions for partitioning the input stream 210 into two streams 221 and 222 using the partitioning key set {A, B} may be used.
Also in FIG. 2B, the query Q2 is assigned to each of a node 252 and a node 254. The query Q2 has a partitioning key set {B, C} and is different from the query Q1. Therefore, an event 212b from the node 232 and an event 216b from the node 234 are processed by the query Q2 of the node 252. Similarly, an event 214b from the node 232 and an event 218b from the node 234 are processed by the query Q2 of the node 254.
For this purpose, an output of the node 232 needs to be partitioned into a stream 242 and a stream 244 by using the partitioning key set {B, C} of the query Q2 and an appropriate hash function to send the corresponding events to the node 252 and the node 254. Similarly, an output of the node 234 needs to be partitioned into a stream 246 and a stream 248 by using the partitioning key set {B, C} of the query Q2 and an appropriate hash function to send the corresponding events to the node 252 and the node 254.
Thus, in the example of FIG. 2B, even though the queries Q1 and Q2 are executed in parallel using four nodes 232, 234, 252, and 254, communications via four streams 242, 244, 246, and 248 occur among the four nodes 232, 234, 252, and 254. These communications consume network resources of the nodes.
US Patent Application Publication No. 2010/0030741, for example, discloses a method that receives a query plan including multiple queries, classifies the queries, computes an optimal partition set for each of the queries, and reconciles the optimal partition set of each of the queries with at least one subset of queries. The method also selects at least one reconciled optimal partition set to be used by each of the queries, and stores the selected at least one reconciled optimal partition set in a computer readable medium.
Japanese Patent No. 4925143, for example, discloses a technology for analyzing a cause of a result of a stream data processing system taking into account a process performed by a unique window operator used in the stream data processing system.
Also, Japanese Laid-Open Patent Publication No. 2011-76153, for example, discloses a technology for automatically generating a query for complex event processing based on an event log. In this technology, patterns of combinations of attribute values frequently appearing in the event log are obtained, and frequently-occurring events are automatically generated based on the obtained patterns. Next, a frequently-occurring event sequence where labeled frequently-occurring events are arranged in the order of occurrence is generated. Then, a query for detecting the occurrence of an incident is generated based on the frequently-occurring event sequence.