The present invention relates to data processing systems, and more specifically, to a method and system for implementing a stream processing computer architecture.
The impact of communication on the performance of computer systems continues to grow both at the macro-level (e.g., blade servers and clusters of computers) and at the micro-level (e.g., within a single processor chip having many cores). The traditional approach to computing, which relies on diminishing the access time to main memory through a hierarchy of cache memories, is reaching a point of diminishing returns. This is true, in part, because of the increasing latency of I/O data transmission with respect to the speed of the processing cores, as well as the increasing fraction of the (limited) on-chip power dissipation budget that is demanded by cache memories and global communication wires. Meanwhile, the tight on-chip power dissipation constraints have caused many major semiconductor companies to move to multi-core or chip multiprocessor (CMP) architectures. The emergence of CMPs has, in turn, placed increased challenges on the communications infrastructure in two major areas. In particular, the growing number of processing cores in CMPs exacerbates the bandwidth requirements for both intra-chip and inter-chip communication. Additionally, CMP architectures vastly increase the programming complexity and ultimate productivity as compared with traditional single-core processor chips.
Stream processing has recently emerged as an alternative computation model approach for systems that are based on CMP architectures and software managed cache memory organization. Many classes of important applications, e.g., digital signal processing and multimedia applications, present fairly regular access to long sequences of regular data structures that can be processed in parallel as opposed to the more randomized access to complex data records that is typical in databases. For these applications, the combination of stream processing with specialty processors such as the nVidia® and AMD/ATI graphic processing units (GPU) or IBM's® Cell Broadband Engine has the potential to offer higher performance and lower power dissipation than the traditional computing paradigm applied to general-purpose CMP architectures.
A sample stream computation graph is shown in FIG. 1. The graph 100 is made of computation nodes, called kernels (102A, 102B, and 102C), which are connected by edges 104A/104B that represent streams of data going from one kernel to another. A kernel refers to software code elements that perform the computation on the streams of data. In graph 100 of FIG. 1, these data streams are unidirectional; that is, the data moves (streams) from the left hand side to the right hand side of the Figure, as shown by the arrow. Kernels may be one of three types: source 102A (representing the origin of a stream of data generated as input to a computation graph); sink 102B (representing the end results in the form of a stream or streams); and regular kernels 102C. A kernel (102A-102C) can have one or more input streams 104A and generate, as a result of its specific computation, one or more output streams 104B.
Typically a stream computation graph (e.g., graph 100) represents a solution to a computer processing problem (e.g., detecting some events or finding a pattern and complex relationships between the input data stream—financial trading of stocks, sensory data correlations and more). The graph persists for as long as the data streams are being processed by the computation kernels, and typically this is a very long time (hours or more or indefinitely). Thus, the topology of this graph is considered to be fixed.
One challenge in dealing with such a stream computational graph is determining how to group the computation nodes (e.g., kernels 102A-102C) into groups such that these can be assigned to physical computation nodes of a computer processing system. There are many possible ways to perform such grouping (also known as scheduling, embedding, or in graph theory as a graph theory transformation known as a graph contraction). As shown in FIG. 1, shaded groups (110A-110C) represent grouping of kernels such that the kernels assigned to one group (such as group 110B as an example) will be located within one physical computation node or a cluster of nodes tightly coupled with or by using a fast local communication network. Then, the total aggregated streams passing from one such group of kernels to another may be viewed as one connection among the groups. In graph theory terms, this can be viewed as a super node within which the regular computation nodes (kernels) have been collapsed into. This type of grouping may be done for all the computation nodes in a stream computation graph. The streams, represented by edges between the kernels of the stream computation graph, can similarly be collapsed into a super edge representing the sum of all streams of data passing between the super nodes.
As an example, as shown in FIG. 1, super nodes 110C and 110B share three streams passing (from left to right) between the super nodes 110B and 110C. They can now be viewed as one stream that connects between super nodes 110B and 110C. In practice, the original streams of data, are aggregated by the physical communication fabric of the stream computing system, such that the ingress point at super node 110B will multiplex the three streams from a group of kernels (e.g., those within super node 110B) into one stream and, at the other end, group of kernels (those within super node 110C) will demultiplex these three streams back and locally connect them to the proper kernels as mapped in one physical computation node or cluster of such nodes.
There has been a growing interest in extending this stream processing paradigm to certain large scale applications in different fields such as finance, data mining, and computational biology. This extension requires going beyond running a stream application on a single GPU-like processor and, instead, involves building large, scalable Stream Processing Systems (SPSs) where many of these processors are interconnected by high-speed interconnection networks. However, building large, scalable stream processing systems suffer from various drawbacks, such as increased transmission bandwidth challenges, as well as increased access times to large data sets in memory from processing nodes.
It would be desirable, therefore, to provide an enhanced stream processing architecture that overcomes the aforementioned drawbacks.