A common limitation to processing performance in a digital system is the efficiency and speed of transferring instruction, data and other information among different components and subsystems within the digital system. For example, the bus speed in a general-purpose Von Neumann architecture dictates how fast data can be transferred between the processor and memory and, as a result, places a limit on the computing performance (e.g., million instructions per second (MIPS), floating-point operations per second (FLOPS), etc.).
Other types of computer architecture design, such as multi-processor or parallel processor designs require complex communication, or interconnection, capabilities so that each of the different processors can communicate with other processors, with multiple memory devices, input/output (I/O) ports, etc. With today's complex processor system designs, the importance of an efficient and fast interconnection facility rises dramatically. However, such facilities are difficult to design to optimize goals of speed, flexibility and simplicity of design.
Currently, parallel programming is based on threads as the central, organizing principle of computing. However, threads are seriously flawed as a computation model because they are wildly nondeterministic and rely on programming style to constrain that non-determinism to achieve deterministic aims. Test and verification become difficult in the presence of this wild non-determinism. One solution has been suggested by GPU (Graphics Processing Unit) vendors is to narrow the forms of parallelism expressible in the programming model. Their focus on data parallelism, however, ties the hands of programmers and prevents exploiting the full potential of multi-core processors.
Further, threads do not just run on a bank of identical cores. A modern computer (supercomputer, workstation, desktop and laptops) contains a bewildering array of different heterogeneous cores all requiring separate programming models to program. For example, a motherboard may have one to four main CPUs (central processing units e.g., Pentium Processor) each having on-die 1 to 6 CPU cores with an on-die or on-package GPU (Graphics Processing Unit—e.g. NVIDIA GPU) which itself contains 16 to 256 GPU cores along with several discrete video & audio encode & decode cores (for the encoding and decoding of a multiplicity of video standards—e.g. MPEG2, MPEG4, VC-1, H.264 etc.). Also on the motherboard are from 1 to 4 discrete high end GPUs each containing 16 to 1024 GPU cores along with several discrete high-end configurable (meaning the core can be selected to encode/deocode a variety of pre-existing standards) video/audio encode & decode cores (for the encoding and decoding of a multiplicity of video standards—e.g. MPEG2, MPEG4, VC-1, H.264 etc., at very high resolutions and with multiple channels of sound). Additional subsystems composed of processing cores are added to the motherboard in the form of communications cores (e.g. TCP/IP offload cores which themselves are typical built from one or more CPU cores and one or more packet processing cores. WiFi cores, Blue Tooth cores, WiMax cores, 3G cores, 4G cores which are from one or more CPU cores and one or more broadband/baseband processing cores).
Current high end of the spectrum devices such as supercomputers add an additional processor in the form of one to four FPGAs (field programmable gate array) per motherboard. Each FPGA is itself composed of hundreds of thousand to tens of millions of very simplistic CLB processing cores along with multiple hard IP or Soft IP CPU core and multiple DSP cores). Then these motherboards themselves are then replicated and interconnected in the hundreds to thousands to produce a modern supercomputer. These systems (either the desktops/workstations/laptops and/or the supercomputers) and then interconnected via the Internet to provide national and global computing capabilities.
The complexity of “managing” and “programming” such a diverse series of cores is a severe problem. Most programmers do not even attempt this and just settle for programming just one CPU core ignoring the rest of the cores. There are a certain number of algorithms know in the industry as “embarrassingly parallel problems” (e.g. the Google Search algorithm for example is simple to spread across multiple CPUs due to the fact that there is very little to no interactivity across the parallel threads). Unfortunately the vast majority of problems do not have these characteristics, they require a high degree of interactivity and synchronization across the multiple threads.
It would therefore be desirable to incorporate multithreading, unrestricted parallelism and deterministic behavior such as in modern programming language streams. Streams date at least to the introduction of the C programming language in 1978, and have been incorporated into such languages as C++, Java, Visual Basic and F#. However, in these languages, streams are relegated to a rather narrow role of providing a framework for I/O and file access. It is therefore desirable to expand the role of streams in parallel programming to first-class objects, a status roughly comparable to that of variables.