This specification relates to stream computing.
Stream computing refers to systems and techniques for continually processing unbounded sequences of objects in real time, e.g., receiving incoming streams of data objects, processing the data objects, and producing output streams of modified data objects. For example, the objects may represent page requests received by a web site or microblog messages posted by users. Stream computing systems can provide immediate search results as data objects are found, as well as continually process new data objects as they are received.
Distributed stream computing systems can include multiple computing nodes that process data to generate sequences of data objects, the sequence referred to as a stream. An example streaming data object is a list of named values referred to as a tuple. The computing nodes can perform various operations on the streams in a particular order.
The operations performed by nodes in a stream computing system can be defined by a topology. A topology is a computing graph of compute nodes and the respective stream transformations performed by the computing nodes. Computing nodes that read raw data and generate streams in the first instance may be referred to as source nodes, or “spouts;” and computing nodes that subscribe to streams, perform operations on the data, and pass on transformed streams may be referred to as processing nodes, or “bolts.” Edges in the topology indicate which bolts subscribe to which streams. Nodes in a stream computing topology are typically configured to process streams indefinitely.
FIG. 1A is a diagram of an example prior art system. The system includes a stream computing subsystem 130 a that processes streams of data objects from a key-value storage subsystem 150, e.g., a Hadoop Database (HBase) that stores data by key-value pairs in distinct column families. The storage subsystem 150 can alternatively be a relational database or any other appropriate storage subsystem.
The stream computing subsystem 130a generates streams using source nodes and processing nodes of a topology 126a according to a topology definition 125a received from a management node 120a. An example stream computing subsystem is the Storm distributed real-time computation system. (Storm is described at http://storm-project.net/ and documentation identified there.)
A user of user device 110 can query the key-value storage subsystem to obtain matching data objects 145. The user device 110 can be a personal computer, smartphone, or any other kind of computer-based device with which a user can interact. The user 110 device issues a query 105 to the management node 120a. The management node 120a parses the query 105 and generates one or more processes required to identify matching data objects 145 that satisfy the query 105. The management node 120a generates a corresponding topology definition 125a of the processes required to satisfy the query 105; the topology definition 125a maps the processes to source nodes and processing nodes. For example, satisfying the query 105 generally requires filter processes that filter streams of data objects from the key-value storage subsystem 150 and read processes to read data objects from the key-value storage subsystem 150. The topology definition 125a is then used to generate a topology that is run on a cluster of computers.
A search subsystem 140 can index data objects in the key-value storage subsystem 150 for more efficient retrieval of matching data objects. An example search subsystem 104 is the Apache Solr™ search platform. (“Apache Solr” is a trademark of The Apache Software Foundation.) One or more source nodes, e.g., source node “Get IDs” 132, in the topology 126a will communicate with the search subsystem 140 to obtain matching identifiers 135 of data objects that satisfy the query 105. The source nodes will then generate matching identifier streams 145 that are received by a processing node “Read Data Objects” 134 in the topology 126a. The processing node 134 receives the matching identifier streams from the source nodes and requests the data objects from the key-value storage subsystem 150. The processing node 134 can use batch processing techniques to improve the performance of reading the data objects from the key-value storage subsystem 150. For example, the processing node 134 can wait to request data objects from the key-value storage subsystem 150 until at least a minimum number of identifier tuples have been received from the matching identifier streams 145. The processing node 134 can then return the matching data objects 145 to the user device 110.
The topology definition 125a that defines the structure of the topology 126a is typically generated automatically according to logic in the management node 120a. Alternatively, the topology definition 125a can be programmed by a developer in advance. A user typically has no runtime control over the structure of the topology 126a in the stream computing subsystem 130a. 