Integrated circuit complexity is developing at an accelerating pace driven by density of endpoints such as the number of transistors that can be fabricated in a given area. This has led to the development of chips of enormous complexity that are known in the industry as Systems on a Chip (SoCs). With this type of device, many functional units are incorporated onto one chip. The functional configurations of this type of chip are virtually limitless. However, this type of chip is commonly used for computing systems, embedded systems, and networking. For example, in a traditional computer, many chips are used for I/O control, instruction processing and many other functions. These endpoints are then connected by a network for communication between the endpoints. Historically, bus-type systems were used for this communication. However, with this type of system, only one pair of endpoints communicates at a time. Also, with the very narrow wires and very long distances between endpoints, capacitive and resistive loading creates timing and power consumption problems.
A more detailed explanation of these issues is found in Keckler, et al. “GPUs and the Future of Parallel Computing,” IEEE Micro (September/October 2011) (http://www.cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/ieee-micro-echelon.pdf), which is hereby incorporated by reference into this specification in its entirety. Keckler et al. note that:                From 2010 to 2017, active components will scale by about 1/16 in area as line widths scale from 40 nm to 10 nm, while wire energy will scale by only ½.        With the scaling projections to 10 nm, the ratios between DFMA [double-precision fused-multiply add], on-chip SRAM, and off-chip DRAM access energy stay relatively constant.        However, the relative energy cost of 10-mm global wires goes up to 23 times the DFMA energy because wire capacitance per square mm remains approximately constant across process generations.        Because communication dominates energy, both within the chip and across the external memory interface, energy-efficient architectures must decrease the amount of state changes per instruction and must exploit locality to reduce the distance data must move.        
To address this problem, SoC designers have developed sophisticated NoCs to connect these components more efficiently than point-to-point busses and simple routers. A key advantage of NoCs is that they facilitate quality of service (QoS) guarantees. With a QoS guarantee, the NoC is configured such data flows receive at least the bandwidth which is provisioned.
However, designing NoCs presents several challenges, among them:                a. NoC interconnect energy does not scale well compared to compute energy and will increasingly become a bottleneck for SoC cost.        b. NoC interconnect bandwidth must be controlled effectively in order to support real-time constraints without over-engineering.        c. Traditional techniques waste bandwidth with over-engineering, without the ability to support isolation and meet real-time constraints.        
Therefore, it is important to develop an Interconnect that can be incorporated onto an SoC and between SoCs that will not create QoS bottlenecks, is energy efficient and exploits locality as much as possible.
Another issue with bus-type systems involves managing data transport, such as for message-passing. For a SoC having multiple cores, conventional inter-process communication (IPC) approaches require a core to manage data-movement and synchronization, which is not efficient since core cycles are expensive in terms of gate-count and power. For example, a core may need to configure a one-sided direct memory access (DMA) to move data, which requires long latency, since the core needs to have information of the entire SoC address space. Such a one-sided DMA can read from any memory location in the entire SoC space and write in any memory location in the entire SoC space, which is complicated and error-prone, leading to long test times and low reliability. Conventionally, between ten and twenty percent of a core's cycles may be used for moving data, allocating buffers, and synchronization. To address this problem, an Autonomic Transport Block can be developed which offloads core cycles and speeds up message-passing, synchronization, and task scheduling.