The challenge of modern computing is to build economically efficient chips that incorporate more transistors to meet the goal of achieving Moore's law of doubling performance every two years. The limits of semiconductor technology are affecting this ability to grow in the next few years, as transistors become smaller and chips become bigger and hotter. The semiconductor industry has developed the system on a chip (SoC) as a way to continue high performance chip evolution.
So far, there have been four main ways to construct a high performance semiconductor. First, chips have multiple cores. Second, chips optimize software scheduling. Third, chips utilize efficient memory management. Fourth, chips employ polymorphic computing. To some degree, all of these models evolve from the Von Neumann computer architecture developed after WWII in which a microprocessor's logic component fetches instructions from memory.
The simplest model for increasing chip performance employs multiple processing cores. By multiplying the number of cores by eighty, Intel has created a prototype teraflop chip design. In essence, this architecture uses a parallel computing approach similar to supercomputing parallel computing models. Like some supercomputing applications, this approach is limited to optimizing arithmetic-intensive applications such as modeling.
The Tera-op, Reliable, Intelligently Adaptive Processing System (TRIPS), developed at the University of Texas with funding from DARPA, focuses on software scheduling optimization to produce high performance computing. This model's “push” system uses data availability to fetch instructions, thereby putting additional pressure on the compiler to organize the parallelism in the high speed operating system. There are three levels of concurrency in the TRIPS architecture, including instruction-level parallelism (ILP), thread-level parallelism (TLP) and data-level parallelism (DLP). The TRIPS processor will process numerous instructions simultaneously and map them onto a grid for execution in specific nodes. The grid of execution nodes is reconfigurable to optimize specific applications. Unlike the multi-core model, TRIPS is a uniprocessor model, yet it includes numerous components for parallelization.
The third model is represented by the Cell microprocessor architecture developed jointly by the Sony, Toshiba and IBM (STI) consortium. The Cell architecture uses a novel memory “coherence” architecture in which latency is overcome with a bandwidth priority and in which power usage is balanced with peak computational usage. This model integrates a microprocessor design with coprocessor elements; these eight elements are called “synergistic processor elements” (SPEs). The Cell uses an interconnection bus with four unidirectional data flow rings to connect each of four processors with their SPEs, thereby meeting a teraflop performance objective. Each SPE is capable of producing 32 GFLOPS of power in the 65 nm version, which was introduced in 2007.
The MOrphable Networked Micro-ARCHitecture (MONARCH) uses six reduced instruction set computing (RISC) microprocessors, twelve arithmetic clusters and thirty-one memory clusters to achieve a 64 GFLOPS performance with 60 gigabytes per second of memory. Designed by Raytheon and USC/ISI from DARPA funding, the MONARCH differs distinctly from other high performance SoCs in that it uses evolvable hardware (EHW) components such as field programmable compute array (FPCA) and smart memory architectures to produce an efficient polymorphic computing platform.
MONARCH combines key elements in the high performance processing system (HPPS) with Data Intensive Architecture (DIVA) Processor in Memory (PIM) technologies to create a unified, flexible, very large scale integrated (VLSI) system. The advantage of this model is that reprogrammability of hardware from one application-specific integrated circuit (ASIC) position to another produces faster response to uncertain changes in the environment. The chip is optimized to be flexible to changing conditions and to maximize power efficiency (3-6 GFLOPS per watt). Specific applications of MONARCH involve embedded computing, such as sensor networks.
These four main high performance SoC models have specific applications for which they are suited. For instance, the multi-core model is optimized for arithmetic applications, while MONARCH is optimized for sensor data analysis. However, all four also have limits.
The multi-core architecture has a problem of synchronization of the parallel micro-processors that conform to a single clocking model. This problem limits their responsiveness to specific types of applications, particularly those that require rapid environmental change. Further, the multi-core architecture requires “thread-aware” software to exploit its parallelism, which is cumbersome and produces quality of service (QoS) problems and inefficiencies.
By emphasizing its compiler, the TRIPS architecture has the problem of optimizing the coordination of scheduling. This bottleneck prevents peak performance over a prolonged period.
The Cell architecture requires constant optimization of its memory management system, which leads to QoS problems.
Finally, MONARCH depends on static intellectual property (IP) cores that are limited to combinations of specified pre-determined ASICs to program its evolvable hardware components. This restriction limits the extent of its flexibility, which was precisely its chief design advantage.
In addition to SoC models, there is a network on a chip (NoC) model, introduced by Arteris in 2007. Targeted to the communications industry, the 45 nm NoC is a form of SoC that uses IP cores in FPGAs for reprogrammable functions and that features low power consumption for embedded computing applications. The chip is optimized for on-chip communications processing. Though targeted at the communications industry, particularly wireless communications, the chip has limits of flexibility that it was designed to overcome, primarily in its deterministic IP core application software.
Various implementations of FPGAs represent reconfigurable computing. The most prominent examples are the Xilinx Virtex-II Pro and Virtex-4 devices that combine one or more microprocessor cores in an FPGA logic fabric. Similarly, the Atmel FPSLIC processor combines an AVR processor with programmable logic architecture. The Atmel microcontroller has the FPGA fabric on the same die to produce a fine-grained reconfigurable device. These hybrid FPGAs and embedded microprocessors represent a generation of system on a programmable chip (SOPC). While these hybrids are architecturally interesting, they possess the limits of each type of design paradigm, with restricted microprocessor performance and restricted deterministic IP core application software. Though they have higher performance than a typical single core microprocessor, they are less flexible than a pure FPGA model.
All of these chip types are two dimensional planar micro system devices. A new generation of three dimensional integrated circuits and components is emerging that is noteworthy as well. The idea to stack two dimensional chips by sandwiching two or more ICs using a fabrication process required a solution to the problem of creating vertical connections between the layers. IBM solved this problem by developing “through silicon vias” (TSVs) which are vertical connections “etched through the silicon wafer and filled with metal.” This approach of using TSVs to create 3D connections allows the addition of many more pathways between 2D layers. However, this 3D chip approach of stacking existing 2D planar IC layers is generally limited to three or four layers. While TSVs substantially limit the distance that information traverses, this stacking approach merely evolves the 2D approach to create a static 3D model.
In U.S. Pat. No. 5,111,278, Echelberger describes a 3D multi-chip module system in which layers in an integrated circuit are stacked by using aligned TSVs. This early 3D circuit model represents a simple stacking approach. U.S. Pat. No. 5,426,072 provides a method to manufacture a 3D IC from stacked silicon on insulation (SOI) wafers. U.S. Pat. No. 5,657,537 presents a method of stacking two dimensional circuit modules and U.S. Pat. No. 6,355,501 describes a 3D IC stacking assembly technique.
Recently, 3D stacking models have been developed on chip in which several layers are constructed on a single complementary metal oxide semiconductor (CMOS) die. Some models have combined eight or nine contiguous layers in a single CMOS chip, though this model lacks integrated vertical planes. MIT's microsystems group has created 3D ICs that contain multiple layers and TSVs on a single chip.
3D FPGAs have been created at the University of Minnesota by stacking layers of single planar FPGAs. However, these chips have only adjacent layer connectivity.
3D memory has been developed by Samsung and by BeSang. The Samsung approach stacks eight 2-Gb wafer level processed stack packages (WSPs) using TSVs in order to minimize interconnects between layers and increase information access efficiency. The Samsung TSV method uses tiny lasers to create etching that is later filled in with copper. BeSang combines 3D package level stacking of memory with a logic layer of a chip device using metal bonding.
See also U.S. Pat. No. 5,915,167 for a description of a 3D DRAM stacking technique, U.S. Pat. No. 6,717,222 for a description of a 3D memory IC, U.S. Pat. No. 7,160,761 for a description of a vertically stacked field programmable nonvolatile memory and U.S. Pat. No. 6,501,111 for a description of a 3D programmable memory device.
Finally, in the supercomputing sphere, the Cray T3D developed a three dimensional supercomputer consisting of 2048 DEC Alpha chips in a torus networking configuration.
In general, all of the 3D chip models merely combine two or more 2D layers. They all represent a simple bonding of current technologies. While planar design chips are easier to make, they are not generally high performance.
Prior systems demonstrate performance limits, programmability limits, multi-functionality limits and logic and memory bottlenecks. There are typically trade-offs of performance and power.
The present invention views the system on a chip as an ecosystem consisting of significant intelligent components. The prior art for intelligence in computing consists of two main paradigms. On the one hand, the view of evolvable hardware (EHW) uses FPGAs as examples. On the other hand, software elements consist of intelligent software agents that exhibit collective behaviors. Both of these hardware and software aspects take inspiration from biological domains.
First, the intelligent SoC borrows from biological concepts of post-initialized reprogrammability that resembles a protein network that responds to its changing environmental conditions. The interoperation of protein networks in cells is a key behavioral paradigm for the SoC. The slowly evolving DNA root structure produces the protein network elements, yet the dynamics of the protein network are interactive with both itself and its environment.
Second, the elements of the SoC resemble the subsystems of a human body. The circulatory system represents the routers, the endocrine system is the memory, the skeletal system is comparable to the interconnects, the nervous system is the autonomic process, the immune system provides defense and security as it does in a body, the eyes and ears are the sensor network and the muscular system is the bandwidth. In this analogy, the brain is the central controller.
For the most part, SoCs require three dimensionality in order to achieve high performance objectives. In addition, SoCs require multiple cores that are reprogrammable so as to maintain flexibility for multiple applications. Such reprogrammability allows the chip to be implemented cost effectively. Reprogrammability, moreover, allows the chip to be updatable and future proof. In some versions, SoCs need to be power efficient for use in embedded mobile devices. Because they will be prominent in embedded devices, they also need to be fault tolerant. By combining the best aspects of deterministic microprocessor elements with indeterministic EHW elements, an intelligent SoC efficiently delivers superior performance.
While the design criteria are necessary, economic efficiency is also required. Computational economics reveals a comparative cost analysis that includes efficiency maximization of (a) power, (b) interconnect metrics, (c) transistor per memory metrics and (d) transistor per logic metrics.
Problems that the System Solves
Optimization problems that the system solves can be divided into two classes: bi-objective optimization problems (BOOPs) and multi-objective optimization problems (MOOPs).
BOOPs consist of trade-offs in semiconductor factors such as (a) energy consumption versus performance, (b) number of transistors versus heat dissipation, (c) interconnect area versus performance and (d) high performance versus low cost.
Regarding MOOPs, the multiple factors include: (a) thermal performance (energy/heat dissipation), (b) energy optimization (low power use), (c) timing performance (various metrics), (d) reconfiguration time (for FPGAs and CPLDs), (e) interconnect length optimization (for energy delay), (f) use of space, (g) bandwidth optimization and (h) cost (manufacture and usability) efficiency. The combination of solutions to trade-offs of multiple problems determines the design of specific semiconductors. The present system presents a set of solutions to these complex optimization problems.
One of the chief problems is to identify ways to limit latency. Latency represents a bottleneck in an integrated circuit when the wait to complete a task slows down the efficiency of the system. Examples of causes of latency include interconnect routing architectures, memory configuration and interface design. Limiting latency problems requires the development of methods for scheduling, anticipation, parallelization, pipeline efficiency and locality-priority processing.
Summary
The present invention features a network on a chip (NoC) in the form of a dynamic 3D intelligent system on a chip (iSoC). Chip network topology, routing architecture and flow dynamics are critical to the performance of the 3D iSoC. The 3D iSoC features novel networking features involving the structure and function of interconnects that markedly improve efficiency relative to other models.
The present invention describes a hybrid network that shares direct and indirect network architectures. In direct networks, each node is connected by a router to each other node, while in indirect networks, each node is connected to a switch which is then connected to other switches which connect to other nodes.
The network model used in the present invention is a hierarchical synthesis of direct and indirect networks. Each octagonal neighborhood has direct point-to-point connections via embedded routers between nodes, while each neighborhood has a switch that connects to both other neighborhoods and to the central core. This configuration benefits the independent operation of each neighborhood as well as the overall operation of the whole chip.
Analogous of the network configuration of the 3D iSoC is the structure and operation of a city. In the center core is a bigger set of larger buildings, while the periphery has multiple independent neighborhoods with smaller buildings. The city has different districts, such as warehouse district (memory), industrial district, shopping district, wholesale district and so on that perform specific functions. Overall, the city combines multiple different functions into a complex whole.
Further employing the analogy, the transportation system of the city is critical to the overall operation. Regarding the structure of the transportation system, a large highway will generally encircle a downtown center, with major arteries leading to the suburbs. Workers will travel from their homes to their offices in cars, buses and trains. The people are like individual data sets distributed in data packets of differentiated size (car versus a train). Organization, and reorganization, of traffic flows in the transportation subsystems determine the healthy functioning of the city. If a road is blocked, traffic backs up and a bottleneck is created. In this event, traffic is rerouted around the disruption. Once the roadway is cleared, traffic will resume its ordinary operations.
In the case of the 3D architecture of an intelligent SoC, multiple dimensions of symmetry extend the analogy. The chip's interconnects are the highways, yet in the 3D context they are even more analogous to the symmetrical functioning of the circulatory system of the body.
The present system introduces a hybrid 3D network in a SoC. The network consists of (a) an arc Benes network for rearrangable intra-neighborhood structure, (b) a 3D clos network with medium bandwidth to connect neighborhoods with hybrid synchronization, (c) globally asynchronous locally asynchronous (GALA) connections using crossbars to connect the central node to the neighborhoods (d) a multi-layer mesochronous communications matrix and (e) a double wishbone 2D torus model with highest bandwidth connecting the main quadrants.
The present disclosure describes solutions to problems involving interconnect structure and dynamics and routing constraints in the 3D environment of a ULSI circuit.
Novelties
The SoC is structured as a dynamic network in which clusters of connection nodes produce a variable configuration contingent on system demands. Continuous network optimization is performed by adaptive routing mechanisms. The hybrid networking system is also scalable.
Advantages of the Present System
The system uses efficient interconnect configurations for maximum energy conservation and energy leakage loss minimization. The increased number of symmetrically configured interconnections in a complex SoC with multiple multi-layer nodes also enables faster throughput, which leads to high performance.
The system makes possible polymorphous computing by employing a hybrid network control model that allows for a globally asynchronous locally asynchronous (GALA) hierarchical computing architecture. The clock speeds of the individual octahedron neighborhoods are variable, which leads to modulated, and efficient, power consumption.
Description of the Invention
(I) Intra-SoC Network Architecture
(1) Hybrid Network Fabric Integrating 3D Geodesic Interconnect Typology
The 3D NoC is fundamentally a network of interconnects distributed between logic and memory components. Interconnects embedded in the circuits link the logic circuitry and the memory circuitry to each layer, while the vias connect the circuitry of one layer of a 3D chip to other layers. Interconnects between chips provide the communications capacity for the SoC to operate as a network with a common switching fabric. Because the SoC is three dimensional, it uses a geodesic configuration to connect the various 3D nodes in Euclidean space using x, y and z dimensions.
Like a cube, the 3D aspect of the chip has six facades with eight corners. The eight corners correspond to a neighborhood cluster of circuit nodes, though the composition of each neighborhood cluster is variable. This configuration into neighborhood clusters allows each cluster to behave autonomously while also interacting with other clusters. The cluster configuration allows multiple nodes to operate independently and in parallel with other nodes.
The interconnects that link the nodes in each neighborhood are structured as a 3D geodesic architecture, with xy, yz and xz (top-down, right-left and front-back) directionality in each neighborhood cluster. These interconnects are like multi-lane roads with two-way traffic. The advantage of using two-way interconnects is to maintain efficiency; employing two one-way connections is not an efficient use of space.
In another embodiment of the invention, each node contains an RF wireless transmitter and receiver for broadcasting data to and receiving data from other nodes. Each node uses a separate bandwidth frequency for identification.
(2) 3D Torus Interconnect and Via Network for Layer to Layer Intranode Connection in 3D NoC and Method for Routing Therein
The use of through silicon vias (TSVs) in a 3D circuit allows the layers of each node to be connected. The present invention uses multiple TSV connections between each adjacent layer and multiple TSVs between non-adjacent layers. In particular, TSVs connect tiles of circuitry on a specific layer to tiles on other layers in the multi-layer integrated circuit. One way to organize this model is to use a planar controller on the side of the multi-layer chip that has access to each layer, much like a bank of elevator shafts.
The present system uses a 3D torus interconnect and via network to connect different layers within a multi-layer circuit and to route information from point to point. This model links the top and bottom layers by using an intermediary layer in the middle. The system's use of intra-layer TSVs maintains extreme efficiency.
In one configuration, alternating memory layers are sandwiched between logic layers. The memory layers have controllers that manage the memory features plus a routing mechanism that routes data to top and bottom layers.
The present system uses a TSV model of inverted broad-based pyramid structures within specific layers. The pyramid structures etch the TSVs in a configuration to connect tiles of specific layers to tiles of adjacent layers. This pattern is reproduced to connect tiles of multiple layers beyond the adjacent layers.
In this model, the interconnects and TSVs connecting the layers in the center of the 3D circuit are more used than peripheral layers. Consequently, the interconnects and TSVs in the central layers have a higher bandwidth than the interconnects and TSVs on the periphery of the chip. Their central location is more strategic and will require increased throughput capability.
(3) Arc Benes Network for Rearrangable Inter-node Connection in 3D NoC
There are eight neighborhood clusters in an NoC organized in a 3D arc Benes network configuration. The 3D Benes network is a form of fat tree communications architecture that connects nodes with vertical and horizontal interconnects in a geometrical configuration similar to the corner of a box. In the present system, the precise constitution of the set of nodes comprising a neighborhood cluster is variable. While the exterior node in each corner and the interior node in each corner are always included in the network cluster configuration, the addition of adjoining nodes will vary contingent upon a specific application. Generally, a neighborhood cluster configuration will have at least four nodes but may have as many as eight. This ad hoc, flexible and on-demand configuration of a neighborhood cluster provides maximum reprogrammability functioning of the overall NoC so as to optimize operations for various applications.
The 3D arc Benes network model uses the interior node and the exterior node of each SoC cubic configuration corner as shared router nodes. The peripheral nodes in the neighborhood will congregate and readjust into specific clusters around these two nodes.
The nodes in a neighborhood cluster have point to point connections. Since the constitution of neighborhood cluster configurations periodically vary, the point to point connections extend to the potential nodes in adjacent neighborhoods. The 3D arc Benes network represents a hybrid connection architecture that optimizes this point to point interconnection scheme between individual nodes as well as the connection architecture between the individual nodes in a neighborhood and the two corner nodes.
Because the interconnects are two-way, they route data in either direction between nodes simultaneously, thereby maximizing throughput capabilities.
The variable neighborhood configuration is critical in order to maintain adaptability and plasticity in 3D NoC reprogramming behaviors.
(4) 3D Clos 8-point Internode Interconnect Structure Using Globally Asynchronous Locally Asynchronous (GALA) Method for Hybrid Hierarchical Network in 3D NoC
While the neighborhood cluster architecture is useful to divide functions in a reprogrammable SoC, the challenge of connecting the octagonal clusters remains. The present system connects the eight neighborhood clusters organized into the corners of a cubic configuration by using a 3D clos network configuration. The inner nodes of each corner connect each neighborhood cluster to an interior connection matrix that connects both the central master core and the other neighborhood clusters.
Though connected to other parts of the 3D NoC network, each neighborhood cluster operates independently and uses its own adjustable clocking regulatory mechanism. This timing schema provides a locally asynchronous process in which each neighborhood cluster's node clocks are adjustable. This asynchronicity is necessary to accommodate the variable composition of each cluster. Once the set of nodes in each cluster is arranged for a particular application, the clocking for the nodes is harmonized for that application.
The linkage of clocking speeds with the other neighborhoods is also asynchronous because other neighborhoods are continuously modifying their clocking structures and thus their variable clocking synchronicity. The linkage of the neighborhoods with the central master core produces a clocking mechanism that is in perpetual disequilibrium.
The advantage of utilizing the GALA network architecture is the construction of a push-pull functional model. From the “top”, data are pushed to the eight neighborhood clusters. From the “bottom”, data are pulled from the neighborhoods.
Each neighborhood has a multilevel switch used to route traffic flows between neighborhoods and between the neighborhoods and the central core. These switches and their connections in a 3D environment reveal a double butterfly network configuration, a layer of switches that stand in a hierarchy between the neighborhoods and the central core. There is more bandwidth upstream, that is, towards the central core, and, correlatively, relatively less bandwidth downstream in the neighborhoods.
(5) Double Wishbone 2D Torus Ring Network Structure in 3D NoC
The eight switches in the interior of each neighborhood cluster are connected to a two dimensional torus ring that is on the outside of the central master node. The torus ring is structured in a double wishbone configuration that loops around the central node and connects to each neighborhood switch. There are four switches in the 2D torus ring that connect to the eight neighborhood switches that correspond to the facade of each of the four sides of the cube.
In another embodiment of the present invention, two 2D torus rings are organized around an axis. One ring is ordered at the axis of the plane corresponding to the sides of the 3D SoC cube, while the other ring is structured at the axis of a plane at ninety degrees. The two rings meet at the edges in order to exchange information at their conjoining two switches. This model provides fault tolerance capabilities because if one of the rings is disabled, the system is still operational.
In a further embodiment of the invention, the 2D torus rings are connected to opto-electronic integrated circuits in order to process very high bandwidth communications. The advantage of the opto-electronic circuits integrated into the 3D NoC is maintenance of high bandwidth at the top of the network hierarchy that connects the master node to the neighborhood clusters.
(6) QoS Optimization of 3D NoC Interconnects with Priority Based Model
Quality of service (QoS) techniques are stochastic processes used to ensure high quality solutions to complex networking problems. Specific algorithms, including shortest path and traveling salesman problem (TSP) algorithms, are employed for continuous load balancing. Solving network optimization problems that allocate resources based on changing priorities is a particular challenge of the present system.
Since the present system constantly readjusts its priorities, the speed of data flows is variable. The recalibration of processes according to varied criteria stimulates discontinuous change. The QoS algorithms optimize the modulating network architecture to accommodate plasticity behaviors.
(7) Intranode Multi-Way Router in 3D NoC
In order to maximize efficiency, the present system integrates a router into a layer of each 3D circuit node. Data from each 3D node are sent to the router from interconnects attached to each layer. Data are also received at the intra-nodal router and sent to multiple layers of the 3D IC. In order to accommodate the traffic flows at peak times, the router circuitry has built-in buffers that queue the data packets for orderly traffic flow within the network.
Nodes in different positions of the NoC emphasize the positioning of the routers in different locations within each node. The routers in opposite corners of the 3D NoC cube appear in different locations of their respective node.
Digital routers appear in the center layer of 3D ICs. However, analog components in routers appear on the periphery of 3D ICs because of their noise interference. In these cases, the peripheral layers have shielding to separate these layers from other layers in the 3D node.
(8) Differentiated Bandwidth Interconnects in Hierarchical 3D NoC
The network architecture in the 3D NoC is hierarchical, with higher bandwidth at the top, connecting the neighborhood clusters and the central master node, and relatively lower bandwidth at the bottom, connecting the individual nodes within each neighborhood. The middle layer consists of the eight switch subsystem that connects the top and bottom layers.
The top layer has high bandwidth capabilities, including implementations with opto-electronic switch circuitry. However, the neighborhood clusters generally have far less bandwidth, with interconnects of less width and less capacity among inter-nodal interconnects. The interconnects connecting the nodes to the switch have intermediate capacity, while the interconnects connecting the nodes to each other have smaller capacity. The interconnects and TSVs within a node are far smaller than internodal interconnects.
This network architecture model corresponds to the function of capillaries in natural circulatory systems, or the function of nerves in neural systems, in which the furthest periphery from the core has the narrowest capillaries or neurons.
The logic of the network architecture described here reveals a fail-safe functional advantage because if the pathways of one node are damaged, the system reroutes the network to optimize functionality of the remaining nodes.
(9) Adaptive Routing in Hierarchical 3D SoC Using Shortest Path Optimization for Load Balancing
The present invention uses optimization algorithms to solve MOOPs involving data routing within the neighborhoods and between nodes in the whole network. The system uses hybrid adaptive routing algorithms to accommodate the two main levels of network control. For the intra-neighborhood routing protocols, the network uses minimally adaptive routing techniques, while the system uses fully adaptive techniques for global routing.
Adaptive routing approaches develop strategies that seek to avoid bottlenecks. In order to perform this network flow routing objective, the pathways used by the flow control process are optimized according to shifting priorities.
In the context of the 3D NoC, the system simultaneously optimizes the global routing with the local routing within the neighborhoods in order to maximize load balancing in a continuously recalibrating mechanism.
(10) Scheduled Routing in 3D NoC Using Buffer Modulation, Metadata and Time Multiplexing
A master scheduler in the central core controls data traffic. Other schedulers are positioned in the eight neighborhood switches in order to control intra-neighborhood data traffic flows. The schedulers route priorities in the queues of the switches. The scheduling mechanisms use time-multiplexing to organize the logic of data flows.
A request for a flit is sent by a receiving router to a sending router. The receiving router then schedules traffic flow between the two routers. The two routers exchange credits and debits with the various requests for flit flows. Since the data flow exchange rate is typically variable, the data are buffered in the routers to modulate the flow control process.
Meta-tags are used to mark each data packet. Metadata for each task are read at each location in the flow control path and routed to the appropriate destination. Use of meta-tags is an efficient method of directing traffic to available nodes.
Arbitration scheduling in each switch is used to minimize latency by employing modulating queuing buffers. The scheduled routing of data flows is constantly modified as the demands on the system change and as the available resource supplies vary. In effect, each reservation of scheduling is a temporary reservation that is constantly updated to accommodate the load balancing of the global network.
The use of scheduling flow control by using meta-tags prevents resource conflicts so as to maximize network flow efficiencies.
(11) Opto-electronic Switch for Variable Mesochronous Clock Synchronization in 3D NoC
Mesochronous clock synchronization refers to intermediation between the top and bottom layers in a hierarchical network. In the context of the 3D NoC, the clocks in the neighborhood node clusters use variable timing to modulate their operations to accommodate on-demand functions. The operation of the globally asynchronous locally asynchronous (GALA) model that connects the elements in the network hierarchy occurs in the middle layer. This mesochronous clock synchronization is constantly modulating between the top and bottom layers.
The multilayer mesochronous communication matrix controls the clocking between the top and bottom layers in the 3D NoC by using opto-electronic switches in the 2D torus ring(s). The integrated optical circuit uses an opto-coupler to interact with other optical circuits at the middle layer. The optical network connector transmits and receives data using a light emitting device and a light receiving device. Data are converted into pulses of light in the opto-electronic switch, while a tiny laser-on-chip transmits the light pulses to other opto-electronic switches. The optical signal is adjusted by using on-board amplifiers and attenuators, while a micro-mirror assembly is used to deflect the optical signal.
Though they are high bandwidth devices, the optical switches represent a bottleneck in the network at the input and output sections. The conversion and de-conversion of photonic signals are made by sections of the opto-electronic switch that contain substantial caching.
The combined opto-electronic switches in the 2D torus ring(s) comprise a photonic grid that connects high bandwidth switches with low power interconnects.
In another embodiment of the system, electronic circuitry is used in the switches in the 2D torus ring(s).
Performance Specifications
Bandwidth Specifications in the 3D NoC 
EconomyStandardHigh PerformanceAddress Space for128 Gb256 Gb256 Gbeach NodeChannel Width at256 bit wide512 bit wide512 bit wideConnecting NodeschannelschannelschannelsBandwidth 4 Gb/s 8 Gb/s 16 Gb/secThroughput perNodeThroughput at 2D 80 Gb/s160 Gb/s320 Gb/storusTotal System140 Gb/s280 Gb/s560 Gb/sBandwidth
(II) 3D SoC Multi-chip Network Architecture
The 3D SoC networks with other chips to create a scalable high performance computing system. In this sense, the SoC is treated as a node in a computer network. The chip is designed to easily network with other SoCs for macro parallelization.
The present invention organizes networked SoCs to have external access to internal neighborhood clusters and nodes for inter-operational behaviors in a multi-extensive processing environment. In this high performance environment, rather than have eight neighborhoods and a central master node, the system has many neighborhood clusters operating autonomously yet interactively in the larger scalable system.
The present invention describes solutions to problems involving the integration and optimization of cubic SoC's in a parallel scalable computing environment.
(1) Method for Stacking 3D SoCs Using a Cubic Junction Network Connection
One of the advantages of using the cubic configuration of the 3D SoC package is that it may be stacked in larger computing systems.
The 3D SoC employs a cubic junction for networking with other SoCs. The junction connects nine lanes in a parallel switch within an SoC to other SoCs. These nine lanes connect to the eight neighborhood clusters and the central master node. This junction switch splits high bandwidth into nine pipelines in order to directly connect with each neighborhood and the master node. The system uses a fat tree interconnection configuration.
This model allows the direct access of various SoC neighborhoods to other SoC neighborhoods.
The high performance networking system uses optical transceiver circuitry at each neighborhood switch to control external traffic.
In another embodiment of the invention, each SoC neighborhood connects wirelessly to other SoCs using high-bandwidth RF technology.
(2) System and Method for Linking Multiple 3D SoC Nodes in Point-to-point Internodal Network
The system is further organized in a network configuration that allows internodal communications. With hundreds of SoCs, thousands of nodes are organized as autonomous units. By numbering the nodes and organizing reconfigurable clusters of 3D circuits, the system accesses various nodes in a point-to-point configuration.
The nine pipeline junction connecting the chips allows access to each chip's double-wishbone 2D torus ring(s), which feed directly to specific neighborhood clusters. These pipelines then access individual nodes in each neighborhood. This point to point network connection configuration allows a more accessible strategy than nearest neighbor connection configurations.
The advantage of this connection configuration model is that it allows the direct interoperation of nodes beyond the neighborhood level. This is important particularly because the neighborhood configurations periodically change.
(3) System and Method for Autonomous Organization of 3D SoC in Scalable Parallel Computer Networks
The 3D SoC is an intelligent chip because it contains reconfigurable, reprogrammable and auto-programmable features. When combined in highly parallel network systems, the SoCs programmability features produce a highly adaptive computer system capable of multi-petaflop performance.
Performance Specifications: High Performance Computing
EconomyStandardHigh PerformanceNumber of8 × 8 × 8 = 51212 × 12 × 12 = 172824 × 24 × 24 = 13,824iSoCs in cubeNumber of cubes888Total number409613,824110,592of iSoCs inSystemTotal number143,360967,6807,741,440of Nodes1.35 TFlop/s per 5.53 PFlop/s 18.66 PFlop/s149.3PFlop/sChip2 TFlop/s per Chip8.192 PFlop/s27.648 PFlop/s221PFlop/s10 TFlop/s per40.96 PFlop/s138.24 PFlop/s1ExaFlop/sChip
Although the invention has been shown and described with respect to a certain embodiment or embodiments, it is obvious that equivalent alterations and modifications will occur to others skilled in the art upon the reading and understanding of this specification and the annexed drawings. In particular regard to the various functions performed by the above described elements (components, assemblies, devices, compositions, etc.) the terms (including a reference to a “means”) used to describe such elements are intended to correspond, unless otherwise indicated, to any element that performs the specified function of the described element (i.e., that is functionally equivalent), even though not structurally equivalent to the disclosed structure that performs the function in the herein illustrated exemplary embodiment or embodiments of the invention. In addition, while a particular feature of the invention may have been described above with respect to only one or more of several illustrated embodiments, such feature may be combined with one or more other features of the other embodiments, as may be desired and advantageous for any given or particular application.
Acronyms
    3D, three dimensional    ASIC, application specific integrated circuit    BOOP, bi-objective optimization problem    CMOS, complementary metal oxide semiconductor    CPLD, complex programmable logic device    D-EDA, dynamic electronic design automation    DIVA, data intensive architecture    DLP, data level parallelism    EDA, electronic design automation    EHW, evolvable hardware    eMOOP, evolvable multi-objective optimization problem    Flops, floating operations per second    FPCA, field programmable compute array    FPGA, field programmable gate array    GALA, globally asynchronous locally asynchronous    HPPS, high performance processing system    ILP, instruction level parallelism    IP, intellectual property    iSoC, intelligent system on a chip    MEMS, micro electro mechanical system    MONARCH, morphable networked micro-architecture    MOOP, multi-objective optimization problem    MPSOC, multi-processor system on a chip    NEMS, nano electro mechanical system    NoC, network on a chip    PCA, polymorphous computing architecture    PIM, processor in memory    QoS, quality of service    RISC, reduced instruction set computing    SCOC, supercomputer on a chip    SoC, system on a chip    SOI, silicon on insulation    SOPC, system on a programmable chip    SPE, synergistic processor element    TLP, thread level parallelism    TRIPS, Tera-op reliable intelligently adaptive processing system    TSV, through silicon via    ULSI, ultra large scale integration    VLSI, very large scale integration    WSPS, wafer level processed stack packages