As each generation of silicon process technology has provided increasing integration density using smaller geometry transistors, central processing unit architects have continually debated how to use the additional device area to increase application performance. With smaller lower capacitance transistors, operating frequency has proportionally increased, yielding a direct performance gain. However, the access time of the memory function that holds the application program has not kept pace with the speed increases in the central processing unit. This is illustrated in FIG. 1. Memory speed improvement 101 has been gradual. Central processing unit speed improvement 102 has been more marked.
As a result, the performance gain that should be realizable from central processing unit operating frequency advances cannot be achieved without corresponding architectural enhancements in the central processing unit program memory path. As noted in FIG. 1, the speed difference between memory and processors has greatly increased in the past few years. As this gap continues to grow, the memory central processing unit interface will have an even greater effect on overall system performance. The traditional solution to reduce the effect of the central processing unit memory interface bottleneck is to use some form of memory hierarchy. In a general-purpose application processor, a cache system is employed that will allow the hardware at run time to keep copies of the most commonly used program elements in faster, internal RAM. In a more deeply embedded, performance sensitive application (such as a DSP), a form of tightly coupled memory is used that will allow the software to copy either a part of or all of the application program into on-chip RAM. In both of these techniques, the hardware architect gains system performance by the direct, brute force method of simply increasing clock frequency. This solution has proven successful because the performance gains by process technology alone have proved enough for current embedded applications, and there is no impact on application developers to migrate to a faster higher performance system.
It is important, for the clear exposition of processor techniques that follow, to define first the term embedded processor system (EPS) as employed here and as differentiated from a conventional non-embedded multi-chip processor system (MCPS). An embedded processor system includes a processor system integrated on a single chip having one or more central processing units plus a full complement of functional features and functional elements. This full complement of features, not normally included in conventional non-embedded multi-chip processor systems (MCPS). The MCPS is formed from one or more single chip central processing units and additional packaged devices performing memory, interface and peripheral circuits and these are assembled on a printed-wire board (PWB).
Additionally we define the embedded multiprocessor system (EMPS) as having multiple central processing units, complex memory architectures and a wide range of peripheral devices all fully integrated on a single chip. Such a system normally includes another special peripheral, an external memory interface (EMIF) coupled to a large amount of external memory. Central processing unit interactions and cache interactions on an embedded processor clearly involve more complex functionality when compared to a non-embedded processor device. Further, the embedded multiprocessor is typically used in a real-time environment leading to additional requirements for the coherent handling of interrupt operations and power consumption control.
The design methodologies used to support existing processors create a bottleneck in the ability for central processing unit designers to maximize frequency gain without extraordinary effort. At the same time the type of applications being considered for next generation embedded processors grows significantly in complexity. Application performance demand outpaces the ability of designers to efficiently provide performance through operating frequency alone at a reasonable development cost.
The disparity between embedded processor application performance requirements and performance gain through operating frequency alone has not gone unnoticed. In many new digital signal processors, two distinct paths have been used to affect increased system performance. The first technique is the use of enhanced central processing unit architectures having instruction level parallelism and the second technique is the use of system task specialization among different types of simpler but more specialized processors. These two paths are outlined below.
The Texas Instruments TMS320C6000 family of digital signal processors provides an example demonstrating the use of an effective central processing unit architecture to gain performance. Many of these devices use a form of instruction level parallelism (ILP) called very long instruction word (VLIW) to extract a performance gain by analyzing the code behavior at the most basic instruction level. The compiler effectively schedules unrelated instructions to be executed in two or more parallel processing units. This allows the processor to do work on more than one instruction per cycle. Since the instruction scheduling and analysis is done by the compiler, the hardware architecture can be simplified somewhat over other forms of instruction level parallelism ILP, such as super-scalar architectures.
Due to this emphasis on the compiler-based performance extraction, there is little impact on the task of application programmers. Application development can be done in a high-level language and be compiled normally. This is done in a non-ILP based system. This ease of application development, coupled with a performance gain without an operating frequency increase has resulted in the success of this form of enhancement. However, these benefits do not come without cost. Both the development effort in creating a new instruction set architecture (ISA), along with the compiler optimizations required are significant. In the future, once the underlying architecture is fixed, the only means of gaining additional performance is by increasing operating frequency.
Other Texas Instruments digital signal processors, the so-called OMAP devices and the TMS320C5441 provide examples of the technique of breaking the target application into fundamental domains and targeting a simpler processor to each domain. Based on system analysis, the system architect breaks the total application into smaller parts and puts together a separate programming plan for each central processing unit in place. In the past, this could have been done only at the board level, where a specialized processor would be targeted for a specific application task. However, the integration density offered by current process enhancements allows these specialized central processing units to be placed on a single die. This enables a tighter coupling between the processors. Fundamentally, the application developer writes code as if he or she was dealing with each processor as an independent platform.
The programmer must be cognizant of the hardware architecture and program each processor independently. Greater coupling between the integrated processors allows for a more efficient passing of data than at the board level. However, the application is primarily written with the focus on the separate processors in the system. Code reuse and porting is difficult even among the processors in the same system, because each processor is really the centerpiece of its subsystem. Each processor may have a different memory map, different peripheral set and perhaps even a different instruction set (such as OMAP). In applications that have very distinct boundaries, such as a cell phone, this method of extracting performance is unparalleled. Each part of the application can be targeted to an optimized processor and programmed independently.
Development efforts are reduced somewhat since a new instruction set is not required to gain performance. However, from an application development and road map perspective, this technique does not offer the ease of use that instruction level parallelism offers. In many applications, there is no clear line where to divide the work. Even when done, the system cannot easily use all the performance of each central processing unit. If one central processing unit is idle while another is very busy, it is difficult to readjust central processing unit loading once the code has been written. If tighter coupling between the system processors is desired, significant software overhead must be added to insure data integrity.