Currently, there is a great deal of expense, time, and effort on the part of architects and designers of application (e.g. mobile, desktop, server, graphics, network, etc.) processors to acquire state-of-art knowledge in low-level, peripheral and infrastructural components and their design. When designing multiprocessor and other chips, up to 75% of the effort is spent independent of the target application of the processor and instead is spent on low-level, peripheral and infrastructural problems related to, for example, voltage regulation, frequency synthesis, power and clock distribution, power-clock gating, voltage-frequency scaling, interconnection network, parallel programming memory consistency model, synchronization and coherence protocols, facilitating execution of programs maximizing parallelism gain and minimizing communication and power. The design of or the processes used for solutions for the above problems may be individually worse than what is state-of-art and may not be components of an optimal and complete framework.
There exists much inefficiency in the current approaches to chip design and interconnection networks. Different low-level, peripheral and infrastructural tasks may duplicate sensors, statistics collection, control elements and the communication pathways between each. The same information required by different mechanisms may be inconsistent. The interconnection network topology, the communication mode and the schemes employed by the low-level, peripheral and infrastructural functions may be poorly matched to requirements. Various controls may be open loop which do not adapt, may adopt overly conservative values for design parameters, and may use closed-loop feedback which adapts reactively, or may use predictive control which also adapts reactively albeit to a forward predicted state resulting in loss of function, performance or power efficiency. Internal memories and functional blocks of cores are either used at some power cost or are not used at some silicon area opportunity cost. Transaction order enforcement, required by the programming memory consistency model, may be duplicated and may be carried out at both ‘high-level’ by Processing Elements (PEs) and at ‘low-level’ by network transport message order enforcement. With virtualization becoming accepted computing practice, a server may host a virtual machine (VM) running an OS with one and a second (VM) running a second cache management and coherence protocol. With datacenters becoming integral to the way large scale computing is done, for large applications, great latency and power reductions are possible by correctly matching the parallel computing paradigm and supporting cache coherence management to the application. Coherence and pre-fetch traffic intensity may be too low and may be ineffective or may be too high and may congest the network. The network may waste bandwidth carrying transactions along complete input-to-output paths that conflict with higher priority in-progress transactions and will be aborted or that are pre-fetch traffic invalidated by branch mis-prediction. Workload characteristics vary too widely to be satisfied by static allocation, scheduling and control policies. It is a challenge to write programs that run efficiently on all parallel architectures. Even when sub-tasks of a task are executed in parallel, the time cost of loading and moving sub-task data and code is incurred.
Field Programmable Gate Arrays (FPGAs) provide a flexible interconnect. At present their interconnect design is based almost entirely on space-division and contributes to overly large chip area usage and power consumption of these devices.