The task of an IC (Integrated Circuit) designer is to translate a specification of an integrated circuit into an implementation, such that all requirements are satisfied and all design objectives are optimised.
IC design can also be described more formally as follows. The specification of a system is described in a language Lspecification, which contains the system's functionality, requirements and design objectives. Typically, this language is a combination of plain English, high level programming languages and mathematical formulae. Further a design language Ldesign is provided, primitive design elements of which correspond to existing (or automatically generated) implementations and constructs which correspond to well-defined interactions between design elements. Examples of design languages are Register Transfer Level (RTL) languages like VHDL or Verilog. Some aspects of VHDL are described for instance in “VHDL: coding and logic synthesis with Synopsis”, Weng Fook Lee, Academic Press, 2000. A distinctive feature of a design language is that descriptions, written in that language, can be translated by a highly automated design flow into an implementation, e.g. into a netlist. In this sense, VHDL per se does not qualify as a design language, only the synthesizable subset of VHDL does. IC design can thus be defined as the process of describing an implementation, using Ldesign, such that this description is consistent with the description of the system specification in Lspecification:Ldesign(implementation)=Lspecification(system)
The cost of designing is primarily determined by the semantic content of the specification (also referred to as the complexity of the system) and the semantic gap between the specification language Lspecification and the design language Ldesign. Because of the progress in VLSI (Very Large Scale Integration) technology, there are strong economical arguments to integrate more functionality onto a single device. As a result, the semantic content of the specification grows continuously. However, due to the limitations of the designer and design tools, there is a limit to the content of the specification for which the semantic gap can be bridged at reasonable cost. Consequently, if the semantics of the design language remain constant, then progress in VLSI technology will inevitably lead to a design crisis. Design crises have occurred several times and history has shown that the proper response to a design crisis is to increase the semantics of the design language, such that the gap narrows.
Each new design language has led to a reduction of the design cost and enabled a further growth of the complexity of the system that could be designed at reasonable cost.
At present, IC designers are again confronted with a design crisis. The state-of-the-art design methodology is rapidly becoming inadequate to handle the design challenges of Systems-On-Chip (SoC) products. SoC products are integrated circuits dedicated to a specific application, which contain a computing engine (such as a microprocessor core, a DSP core, an MPEG core, etc.), memory and logic on a single chip. SoCs drive the growth of applications such as digital cell phones, digital set-top boxes, video games, DVD players, disk drives, workstations to name but a few.
A current design flow is shown in FIG. 1. A hardware (HW) specification of a system is translated, e.g. using VHDL, into an RT Level model, which is then simulated or co-simulated, e.g. again using VHDL, to verify the functional and structural correctness thereof, so as to obtain a verified RTL model. This verified RTL model is used to generate a netlist, which contains all devices, analysis commands and options, and test vectors, which are used by an ASIC foundry to create an ASIC. Measurements can then be carried out on the implemented ASICs, and if errors are noticed, a device re-spin has to be done.
The shortcomings of the current design flow are the following:                The design productivity of an RTL based design flow incurs unacceptable design cost and time-to-market. For example, present state-of-the-art VLSI technology (e.g. TSMC 0.18μ) has an integration density of 80,000 gates/mm2. A die of 100 mm2 has a capacity of 8 million gates. Even if it is assumed that the design productivity is 1000 gates/person-day (which is very competitive), the design would require 8000 person-days or more than 36 man-years.        Simulations at the RT level are too slow for adequate, pre-manufacturing verification. The number of cycles that can be simulated per second (the simulation speed) decreases because the system complexity increases the amount of computations per cycle. In addition, the number of cycles that must be simulated for sufficient verification coverage also increases because of the increased system complexity. These two factors make it virtually impossible to achieve first-time-right designs with an RTL based approach to SoC design, leading to expensive and time consuming device re-spins.        To boost the design productivity, previously designed units need to be reused. However, reuse of design units is seldom possible as is. Often modifications are required (e.g. because of clocking or test schemes, because the architecture is not appropriate for the latest VLSI technology, because the interface has to be modified, etc.), implying that the complete verification has to be repeated.        SoC architectures are increasingly dominated by RISC (Reduced instruction Set Computer) and DSP (Digital Signal Processor) cores, with embedded software representing perhaps 50-90% of the functionality. However, the RTL-based design flow does not address this issue. Hardware and software developments are decoupled activities. The only link is a co-simulation of the software at the Instruction Set Simulation (ISS) level and the hardware at the RT Level. Both levels are too low to enable the simulation speed required for sufficient verification coverage.        Logic synthesis performs netlist optimisations based on area and performance estimates of design options. However, with deep sub-micron technologies, these estimates are becoming less accurate because the actual performance depends to a large extent on the detailed placement and routing, which is not yet available during synthesis. This means that the actual performance after placement and routing can differ substantially from the estimates made by logic synthesis. A large number of synthesis/placement & routing iterations may result before an implementation is found that matches the performance requirements.        
A design crisis as mentioned above is often attributed to the growing gap between design complexity and design productivity. This is, however, an oversimplification of the problem. The gap between design complexity and design productivity is not the cause of the design crisis, but merely a symptom of the semantic gap between the specification and design language. A solution can be found in raising the semantic level of the design language.
The basic idea in raising the semantic level of the design language is that the use of threads as the primitive design element results in raising the semantic level of Ldesign. Threads use a von Neumann computational model: their behavior is described as a sequence of instructions that modify variables. Variables correspond to addresses in memory, according to a mapping defined by a compiler. A thread is a sequence of instructions with a single locus of control; i.e., when executing a single thread only one program counter is required which points to the currently active instruction. Multi-threaded programs have multiple control loci, implying parallelism.
With threads as primitive design elements, the design process is equivalent to the creation of a multi-threaded description that contains sufficient parallelism, such that the specified functionality can be implemented with the required performance at minimal cost.
The semantic level of design languages based on threads is considered higher than the level of RTL design languages for the following reasons:                Firstly, the primitive design element of RTL languages, such as VHDL or Verilog, is a clocked process. A clocked process describes the behavior as a sequence of instructions that modify signals. Signals correspond to registers. The signals contain the state of the system. As the size of the system grows, its state grows. With current VLSI technology, large amounts of state are preferably stored in memory and not in registers. RTL languages are not well suited to describe operations on a state that is stored in memory. Because of its computational model, threads are better suited. For example, adding two variables can be done with a single instruction. An RTL description requires therefore a Finite-State Machine (FSM) that first fetches the operands, performs the addition and then stores the result in a memory.        Secondly, threads are better suited to control the parallelism of a design. RTL descriptions imply maximal parallel implementation. For example, the statements                    if (Clk'event and Clk=‘1’) thenc<=a+b;f<=d+e;            end if;inside a clocked process, imply 2 additions executing in parallel. This property makes it difficult to trade performance for cost. Suppose results c and f are not required simultaneously (e.g. because they are stored in a memory), a single adder would be sufficient to implement the equations above. However, this is not easily described in RTL design languages. Threads do not imply maximal parallelism. For example, the statements:            thread—1:c=a+b;f=d+e;mean that first c is calculated and then f. Since there are no data dependencies, the compiler may decide to execute these statements in parallel anyway (e.g. by using an Arithmetic Logic Unit (ALU) and an Address Calculation Unit (ACU)). A thread does not imply parallelism, but may still contain fine-grain parallelism that can be exploited by a clever compiler. Moreover, a designer can create parallelism by forking a single thread into multiple threads.            thread—1:c=a+b;            thread—2:f=d+e;Depending on the performance requirements, the compiler may decide to execute the threads in parallel on two separate CPUs (Central Processing Units), or concurrently, on a single CPU one thread after the other.                        
So, while RTL descriptions imply parallelism, multi-threaded descriptions contain parallelism that can be, but must not be, exploited by the compiler. Multi-threaded descriptions are therefore to a large extent architecture independent, while RT level descriptions are not.                Thirdly, the on-chip performance outpaces the off-chip performance. For example, in 1989, the Intel 486 was clocked at 25 MHz and in 1995, the Intel Pentium Pro was clocked at 150 MHz, while the performance of PCB (Printed Circuit Board) technology basically remained unchanged. Although off-chip bandwidth can be bought (by increasing the number of pins), external data access latency will eventually become the bottleneck. This means that eventually the multiplexing factor of hardware unit can increase. RTL languages do not handle this type of reduced parallelism very well.        
RTL languages are well suited for descriptions of implementations with maximal parallelism, while multi-threaded descriptions cover the remaining part of the spectrum, as shown in FIG. 2. In this respect, both languages are complementary. Systems with high bandwidth requirements are likely to use both. Front-end processing is preferably described with RTL, while the remaining functionality can be described with threads. As VLSI technology improves, functions will gradually shift from right to left in FIG. 2: threads can be merged because the processors get faster and RTL functions can be moved to threads. In that respect, multi-threaded descriptions move the design process completely into the software domain for all but very high speed front-end processing.
Traditional approaches to ASIC (Application Specific Integrated Circuit) architecture are based on dedicated hardware, connected through dedicated busses. The dedicated hardware is implemented as a set of registers, with combinational logic in between, as shown in FIG. 3. A hardware specification is converted into an architecture. This architecture is translated, by RTL coding and logic synthesis, into a netlist. The netlist is then converted, by place and route algorithms, into a layout configuration. The advantages of this architecture are:                It achieves high performance at low silicon cost because dedicated solutions tend to be more efficient than non-dedicated ones.        It offers excellent product differentiation.        RTL descriptions can be mapped on this architecture by means of logic synthesis.        
However, the traditional approach suffers from high design cost and long time-to-market, resulting from the design of application specific solutions. For example, the use of dedicated busses tends to create routing problems that complicate the deep-sub-micron ASIC back-end design flow. Moreover, the architecture lacks flexibility to deal with design or specification errors, changing product requirements due to market dynamics or standard upgrades. Product re-spins are required to compensate for this lack of flexibility. However, re-spins are becoming less and less attractive because of increasing costs of masks, because they absorb scarce design resources and because they introduce slips in the development schedule that could delay product roll-out beyond the market opportunity window.
An interconnection network based on busses, such as the one shown in FIG. 4 requires the use of a shared medium for exchanging messages and has several drawbacks:                A network based on a single shared medium does not scale well with the number of clients because the shared medium saturates and becomes the bottleneck when new clients are added.        Long busses create several technological problems, such as excessive capacitive load which are a potential source of ramp-time errors, spreading of the clock skew problem over the entire chip. These problems are expected to become even worse in deep sub-micron VLSI technology.        With deep sub-micron technology, the main source of delay is interconnection delay. Long busses will be the main source of performance degradation. The wire delay can be approximated by:tw=RdCw+(RwCw)/2where Cw is the wire capacitance, Rw is the wire resistance. This model is quite accurate if the time of flight along the wire is smaller than the signal rise time. Taking vAlu=108 cm/s, the time of flight is given by tf=0.1 ns. This is still below the rise times of the buffers that drive large busses. Note that the wire delay scales with I2wire; therefore, long busses are not recommended. Moreover, consider ideal scaling of CMOS dimensions with a factor S; i.e. all horizontal and vertical dimensions are reduced by the same factor, while keeping the electrical field strength constant. The latter implies that the power supply voltage must also be reduced with the same factor. Under ideal scaling, the product RwCw for global wires increases with S2. On the other hand, gate delays decrease with 1/S. Therefore, wire delays become dominant. Consequently, a high performance architecture must not use long lines.        
There is a need for a new architecture that:                offers flexibility to deal with errors and changing requirements, without expensive re-spins.        offers an acceptable price/performance ratio.        can be customised to offer product differentiation.        is a convenient target for mapping multi-threaded descriptions.        