1. Technical Field
The present disclosure generally relates to relational queries, and more particularly to transforming relational queries to stream processing.
2. Discussion of Related Art
An application may execute one or more relational queries against a database or a data warehouse in response to continuous receipt of live data. Frequently, the queried data is static, or the application can tolerate slightly-stale data. In this case, the application does not really use the strengths of the database technology, since the data can be periodically pre-computed and has only loose synchronization and locking needs. When data rates and volumes are small, this approach works well, which is why it is widely employed in industry. However, when data volumes and data rates increase, a natural slowdown occurs as database operations become a bottleneck due to disk accesses and transactional semantics.
Since the relational queries go to a traditional disk-based database, in a next approach, one could attempt to reduce the bottleneck by optimizing the disk-based database. However, this approach requires a large engineering effort, and may be hampered by requiring synchronization and transactional guarantees that are not needed by the application. Further, this approach requires slow disk accesses.
In another approach, performance may be improved by use of a materialized view, which is a concrete table that caches the result of a frequent query. When this query is issued again, it is rewritten so it can be serviced by the materialized view instead of the backing database. This approach partially addresses the performance concern, in that queries to the materialized view are faster than the original queries. However, this approach still uses traditional disk-based database technologies, and thus suffers from slow disk accesses.
In yet another approach, which addresses the slow disk issue, one switches to an in-memory database. However, an in-memory database only provides the limited computing power available inside a single computing node. Further, in-memory databases are limited to data volumes that fit in memory on a single computing node.
Streaming workloads and applications gave rise to new data management architectures as well as new principles for application development and evaluation. InfoSphere Streams is a stream processing middleware from IBM that supports structured as well as unstructured data stream processing and the execution of multiple applications from a community of users, simultaneously. These applications can be scaled to a large number of computing nodes and can interact at runtime through stream importing and exporting mechanisms.
InfoSphere Streams applications take the form of dataflow processing graphs. A flow graph consists of a set of operators connected by streams, where each stream has a fixed schema and carries a series of tuples. The operators can be distributed on several computing nodes.
However, conventional relational queries, such as those from the structured query language (SQL), cannot be easily applied to streams. Thus, there is a need for methods and systems that can transform relational queries relational queries into stream processing that can run on platforms such as InfoSphere Streams.