A common limitation to processing performance in a digital system is the efficiency and speed of transferring instruction, data and other information among different components and subsystems within the digital system. For example, the bus speed in a general-purpose Von Neumann architecture dictates how fast data can be transferred between the processor and memory and, as a result, places a limit on the computing performance (e.g., million instructions per second (MIPS), floating-point operations per second (FLOPS), etc.).
Other types of computer architecture design, such as multi-processor or parallel processor designs require complex communication, or interconnection capabilities so that each of the different processors can communicate with other processors, multiple memory devices, input/output (I/O) ports, etc. With today's complex processor system designs, the importance of an efficient and fast interconnection facility rises dramatically. However, such facilities are difficult to design to optimize goals of speed, flexibility and simplicity.
Currently, parallel programming is based on threads as the central, organizing principle of computing. However, threads are flawed as a computation model because they are wildly non-deterministic and rely on programming style to constrain non-determinism to achieve deterministic aims. Test and verification become difficult in the presence of this wild non-determinism. One solution has been suggested to narrow the forms of parallelism expressible in the programming model, which is what the GPU (Graphics Processing Unit) vendors have done. Their focus on “data parallelism,” however, ties the hands of programmers and prevents them from exploiting the full potential of multi-core processors.
Further, threads do not just run on a bank of identical cores. A modern computer (supercomputer, workstation, desktop and laptops) contains a bewildering array of different heterogeneous cores all requiring separate programming models to program. For example, one to four main CPUs (central processing unit—e.g. Pentium Processor) on a motherboard each having 1 to 6 CPU cores on die with an on-die or on-package GPU (Graphics Processing Unit—e.g. NVIDIA GPU) which itself contains 16 to 256 GPU cores along with several discrete video & audio encode & decode cores (for the encoding and decoding of a multiplicity of video standards—e.g. MPEG2, MPEG4, VC-1, H.264 etc.). Also on the motherboard are from 1 to 4 discrete high end GPUs each containing 16 to 1024 GPU cores along with several discrete high-end configurable (meaning the core can be selected to encode/decode a variety of pre-existing standards) video/audio encode & decode cores (for the encoding and decoding of a multiplicity of video standards—e.g. MPEG2, MPEG4, VC-1, H.264 etc., at very high resolutions and with multiple channels of sound). Additional subsystems composed of processing cores are added to the motherboard in the form of communications cores (e.g. TCP/IP offload cores which themselves are typically built from one or more CPU cores and one or more packet processing cores. WiFi cores, Blue Tooth cores, WiMax cores, 3G cores, 4G cores which are from one or more CPU cores and one or more broadband/baseband processing cores).
Some high end devices such as supercomputers add an additional processor in the form of one to four FPGAs (field programmable gate array) per motherboard. Each FPGA is itself composed of hundreds of thousand to tens of millions of very simplistic CLB processing cores along with multiple hard IP or soft IP CPU core and multiple DSP cores). Then these motherboards themselves are then replicated and interconnected in the hundreds to thousands to produce a modern supercomputer. These systems (either the desktops/workstations/laptops and/or the supercomputers) and then interconnected via the Internet to provide national and global computing capabilities.
The complexity of “managing” and “programming” such a diverse series of cores is a severe problem. Most programmers do not even attempt this and just settle for programming just one CPU core ignoring the rest of the cores. There are a certain number of algorithms know in the industry as “embarrassingly parallel problems” (e.g. the Google Search algorithm for example is simple to spread across multiple CPUs due to the fact that there is very little to no interactivity across the parallel threads). Unfortunately the vast majority of problems do not have these characteristics, they require a high degree of interactivity and synchronization across the multiple threads.
It would therefore be desirable to incorporate multithreading, unrestricted parallelism and deterministic behavior such as in modern programming languages to streams. Streams date at least to the introduction of the C programming language in 1978, and have been incorporated into such languages as C++, Java, Visual Basic and F#. However, in these languages streams are relegated to a rather narrow role of providing a framework for I/O and file access. It is therefore desirable to expand the role of streams in parallel programming to first-class objects, a status roughly comparable to that of variables. A compiler and related components are needed to convert source code Stream programs to object code adapted to multi-core systems including configurable hardware cores.