Generally, system-on-a-chip designs (SoCs) are based on a combination of programmable processors (central processing units (CPUs), microcontrollers (MCUs), or digital signals processors (DSPs)), application-specific integrated circuit (ASIC) functions, and hardware peripherals and interfaces. Typically, processors implement software operating environments, user interfaces, user applications, and hardware-control functions (e.g., drivers). ASICs implement complex, high-level functionality such as baseband physical-layer processing, video encode/decode, etc. In theory, ASIC functionality (unlike physical-layer interfaces) can be implemented by a programmable processor; in practice, ASIC hardware is used for functionality that is generally beyond the capabilities of any actual processor implementation.
Compared to ASIC implementations, programmable processors provide a great deal of flexibility and development productivity, but with a large amount of implementation overhead. The advantages of processors, relative to ASICs are:                Re-use. An application developed once can be implemented on other processors that are at least binary compatible and often only source-level compatible.        Verification leverage. Interfaces are standard, and hardware verification can use relatively standard infrastructure for processor verification from one implementation to the next.        Overlapped development. Software development can be done in parallel with hardware development, or even afterwards.        Track evolving requirements. Since the implementation is based on software, a single hardware platform can satisfy different performance and/or feature requirements.The disadvantages of processors, relative to ASICs are:        Inefficient algorithm mapping. Processors implement specific sets of native datatypes, such as character, short integers, and integers, and these often don't map well to the actual datatypes required by a set of applications, particularly for signal and media processing.        Area inefficiency. To provide flexibility, processor features are normally a union of the requirements of a set of applications, but not optimized for any particular one. Moreover, the requirement to execute existing applications implies that legacy features have to be carried forward to new designs regardless of their fundamental value.        Power inefficiency. This is related to area inefficiency, but there are additional causes, particularly in high-performance implementations. It is common for the hardware devoted to fundamental algorithm operations to be a small subset of the overall implementation, with the remainder devoted to pipelining, branch prediction, caches, etc. As a result, power dissipated is much larger than the power required by fundamental operations.        Energy inefficiency. To support code generation, processors normally spend approximately 30% of execution time performing fundamental operations: the remaining cycles are spent for load, store, flow control (branch) and procedure linkage. If the application executes in a conventional operating environment (RTOS or HLOS), this percentage can be significantly smaller, because of the cycles spent in the operating environment. So the power inefficiency, combined with the number of overhead cycles not directly related to the fundamental application, results in a relatively large energy dissipation compared to what is actually required by the application.        Poor performance scalability. There are two reasons for this. Deep sub-micron process technology, particularly interconnect and transistor scaling effects, lead to performance scaling that is much lower than the “historical” factor of roughly doubling performance every two years. However, even if scaling could keep this pace, the algorithm requirements have grown at a much steeper rate—for example, video processing grows quadratically with resolution.        
Not surprisingly, a motivation for ASICs (other than hardware interfaces or physical layers) is to overcome the weaknesses of processor-based solutions. However, ASIC-based designs also have weaknesses that mirror the advantages of processor-based designs. The advantages of ASICs, relative to processors are:                Efficient algorithm mapping. ASIC hardware is customized to the data types, formats, and operations required by the application.        Power Efficiency. Active area can be near the minimum required, because this area is customized to what the application can require and no more.        Energy Efficiency. Not only is active area minimized, but operational hardware (non-control) can be utilized at close to 100%, so cycle count is minimized. Hardware is controlled by state machines, adding little or no cycle overhead        Performance scalability. Functions can be pipelined or performed in parallel, to the level of throughput required. Communication mostly uses short, local interconnect and isn't as sensitive to interconnect scaling as is involved in controlling and clocking a large processor.The disadvantages of ASICs, relative to processors are:        Low re-use. The large amount of customization accomplished with ASICs implies that very little of a particular design has applicability elsewhere.        No verification leverage. Verification is tied to the blocks and interfaces specific to the design, and each design has custom verification environment.        Serial Development. Algorithms and requirements are defined before the design can begin, and little change is possible after design begins        Poor adaptability. Algorithms and requirements should remain mostly “frozen” throughout development—or very nearly so. There is little opportunity to trade off performance and area for multiple cost-performance targets.        Area inefficiency. To provide any sort of flexibility, for example targeting multiple video codecs, hardware is replicated, since the potential for re-use is limited. This is analogous to the area overhead in processors required to provide generality.        
Parallel processing, though very simple in concept, is very difficult to use effectively. It is easy to draw analogies to real-world example of parallelism, but computing does not share the same underlying characteristics, even though superficially it might appear to. There are many obstacles to executing programs in parallel, particularly on a large number of cores.
Turning to FIG. 1, an example of a conversion of a conventional serial program 102 to a functionally equivalent parallel program 104 can be seen. As shown, the serial program 102 (and the corresponding parallel program 104) are generally comprised of code sequences or subroutines 120 and 122 that each include a number of instructions. In particular for code sequence 120, a value for a variable x is defined by function 106, and this variable x is used to define a value for a variable z in function 108 of code sequence 122. When executed as serial program 102 on a single processor, the value for variable x is transmitted from definition (by function 106) to use (in function 108) in a processor register or memory (cache) location, taking no more than a few cycles.
However, when code sequences 120 and 122 are converter from serial program 102 to parallel program 104 so as to be executed on two processors, several issues arise. First, sequences 120 and 122 are controlled by two separate program counters so that if the sequences 120 and 122 are left “as is” there is generally no way to ensure that the value for variable x is valid on the attempted read in sequence 122. In fact, in the simplest case, assuming both code sequences 120 and 122 execute sequentially starting at the same time, the value for variable x is not defined in time, because there are many more instructions to the definition of variable x in sequence 120 than there are to the use of variable x in sequence 122. Second, the value for variable x cannot be transmitted through a register or local cache because, although code sequences 120 and 122 have a common view of the address for variable x, the local caches map these addresses to two, physically distinct memory locations. Third, although not shown directly in the FIG. 1, there can be a second update of the value in variable x in sequence 120, but this subsequent update of variable x by sequence 120 should not occur until the previous value has been read by sequence 122.
For at least these reasons, the serial program 102 should be extensively modified to achieve correct parallel execution. First, sequence 120 should wait until sequence 120 signals that variable x has been written, which causes code sequence 122 to incurs delay 112. Delay 112 is generally a combination the cycles that sequence 120 takes to write variable x and delay 110 (the cycles to generate and transmit the signal). This signal is usually a semaphore or similar mechanism using shared memory that incurs the delay of writing and reading shared memory along with delays incurred for exclusive access to the semaphore. The write of variable x in sequence 120 also is subject to a barrier in that sequence 122 cannot be enabled to read variable x until sequence 122 can obtain the correct value for variable x. Generally, there can be no ordering hazards between writing the value and signaling that it has been written, caused by buffering, caching, and so forth, which usually delays execution in sequence 120 some number of cycles (represented by delay 114) compared to writes of unshared data directly into a local cache.
Second, sequence 122 generally cannot read its local cache directly to obtain variable x because the write of variable x by sequence 120 would have caused an invalidation of the cache line containing code sequence 120. Sequence 122 incurs additional delay 116 to obtain the correct value from level-2 (L2) cache for sequence 120 or from shared memory. Third, sequence 122 generally imposes additional delays (due in part to delay 118) on sequence 120 before any subsequent write by sequence 120 so that all reads in sequence 122 are complete before sequence 120 changes the value of variable x. This not only can stall the progress of sequence 120 but can also delay the new value of variable x such that sequence 122 has to wait again for the new value. Because of the number of cycles that sequence 122 spends obtaining the value for variable x, sequence 120 could potentially be ahead in subsequent iterations even though it was behind in the first iteration, but synchronization between sequences 120 and 122 tends to serialize both programs so there is little, if any, overlap.
The operations used to synchronize and ensure exclusive access to shared variables normally are not safe to implement directly in application code because of the hazards that can be introduced (e.g., timing-dependent deadlock). Thus, these operations are usually implemented by system calls, which cases delays due to procedure call and return and, possibly, context switching. The net effect is that a simple operation in sequential code (i.e., serial program 102) can be transformed into a much more complex set of operations in the “parallel” code (i.e., parallel program 104), and have a much longer execution time. The result is that parallel programming is limited to applications that do not incur significant overhead for parallel execution. This implies that: 1) there is essentially no data interaction between programs (e.g., web servers); 2) the amount of data shared is a small portion of the datasets used in computing (e.g., finite-element analysis); or 3) the number of computing cycles is very large in proportion to the amount of data shared (e.g., graphics).
Even if the overhead of parallel execution is small enough to make it worthwhile, overhead can significantly limit the benefit. This is especially true for parallel execution on more than two cores. This limitation is captured in a simplified equation for the effect, known as Amdahl's Law, which compares the performance of single-core execution to that of multiple-core execution. According to Amdahl's Law, a certain percentage of single-core execution cannot feasibly be executed in parallel because the overhead is too high. Namely, the overhead incurred is the sum of the percentage of time spent without parallel execution and the percentage of time spent for synchronization and communication.
Turning to FIG. 2, a graph can be seen that depicts speedup in execution rate versus parallel overhead for a multi-core systems (ranging from 2 to 16 cores), where speedup is the single-processor execution time divided by the parallel-processor execution time. As can be seen, the parallel overhead has to be close to zero to obtain a significant benefit from large number of cores. But, since the overhead tends to be very high if there is any interaction between parallel programs, it is normally very difficult to efficiently use more than one or two processors for anything but completely decoupled programs.
Further limiting the applicability of parallel processing is the cost of multiple cores. In FIG. 3, the die areas of processors 302, 306, and 310 are compared. Processor 310 has 16 high-performance general-purpose cores 312, processor 306 has 16 moderate-performance general-purpose cores 308, and processor 302 has 16 high-performance custom cores 304. As can be seen, the high-performance general-purpose processor 310 uses the largest amount of area, and the application-specific processor 302 uses the least amount of area.
Turning to FIG. 4, the throughput of processors 302, 306, and 310 can be seen. The block for processor 302 illustrates die area assuming that throughput (results 402) is determined only by the basic operation required by an application—assuming that only the functional units determine throughput, thus maximizing the operations per cycle per mm2 (comparable to what could be accomplished with a hard-wired ASIC). The block for processor 306 illustrates the effect of including loads, stores, branches, and procedure calls into the mix of operations, where it can be assumed that these operations (in sum) to represent roughly two-third of the cycles taken, reducing throughput by a factor of 3. To achieve the same throughput as that determined by the basic functions, the number of cores should be increased by a factor of 3 to compensate. The block for processor 310 illustrates the effect of adding system calls, synchronization, context switches, and so forth, which reduces throughput by another factor of 3, requiring a factor of 3 increase in the number of cores to compensate.
There is another dimension to the difficulty of parallel computing; namely, it is the question of how the potential parallelism in an application is expressed by a programmer. Programming languages are inherently serial, text-based. Transforming a serial language into a large number of parallel processes is a well-studied problem that has yielded very little in actual results.
Turning to FIG. 5, an example of a conversion of serial source code 502 to parallel implementation 504 with conventional symmetric multiprocessing (SMP) using OPENMP® (which is a register trademark of OpenMP Architecture Review Board Corp., 1906 Fox Drive Champaign, Ill. 61820) can be seen. OPENMP® programming involves using a set of pre-defined “pragmas” or compiler directives that allow the programmer to aid the compiler in locating opportunities for parallel execution. These “pragmas” are ignored by compilers that do not implement OPENMP®, so the source code can be compiled to execute serially, with equivalent results to the parallel implementation (though the parallel implementation can introduce errors that do not appear in the serial implementation).
As shown, this example illustrates the use of several directives, which are embedded in the text following the headers (“#pragma omp”). Specifically, these directives include loops 506 and 508 and function 510, and each of loops 506 and 508 respectively employs functions 512 and 514. This source code 502 is shown as a parallel implementation 504 and is executed on four threads over four processors. Since these threads are created by serial operating-system code 502, the threads are not generally created at exactly the same time, and this lack of overlap increases the overall execution time. Also, the input and result vectors are shared. Reading the input vectors generally can require synchronization periods 516-1, 516-3, 516-5, and 516-7 to ensure there are no writers updating the data (a relatively short operation). Writing the results in write periods 518, 520, 522, 524, 526, 528, 530, and 532 can require synchronization periods 516-2, 516-4, 516-6, and 516-8 because one thread can be updating the result vectors at any given time (even though in this case the vector elements being updated are independent, serializing writes is a general operation that applies to shared variables). After another synchronization and communication period 516-9, 516-10, 516-11, and 516-12, the threads obtain multiple copies of the result vectors and compute function 510.
As shown, there can be significant overhead to parallel execution and a lack of parallel overlap, which is why parallel execution is made conditional on the vector length. It might be uncommon for the compiler to chose to implement the code in parallel, as a function of the system and the average vector length. However, when the code is implemented in parallel, there are a couple of subtle issues related to the way the code is written. To improve efficiency, the programmer should recognize that the expression for function 510 can be executed by multiple threads and obtain the same value and should explicitly declare function 510 as a private variable even though the expression that assigns function 510 contains only shared variables. Declaring function 510 as shared would result in four threads serializing to perform the same, lengthy computation to update the shared variable function 510 with the same value. This serialization time is on the order of four times the amount of time taken to complete the earlier, parallel vector adds, making it impossible to benefit from parallel execution and making vector length the wrong criteria for implementing the code in parallel since this serialization time is directly proportional to vector length. Furthermore, whether or not function 510 can be private is a function of the expression that assigns the value. For example, assume that function 510 is later changed to include a shared variable “offset” as follows:
(1) scale=sum(a,0,n)+sum(z,0,n)+offset++;
In this case, function 510 should be declared as shared, but it is insufficient. This change implies that the code should not be allowed to execute in parallel because of serialization overhead. Code development and maintenance not only includes the target functionality, but also how changes in the functionality affect and interact with the surrounding parallelism constructs.
There is another issue with the code 502 in this example, namely, an error introduced for the purpose of illustration. The loop termination variable n is declared as private, which is correct because variable n is effectively a constant in each thread. However, private variables are not initialized by default, so variable n should be declared as shared so that the compiler initializes the value for all threads. This example works well when the compiler chooses a serial implementation but fails for a parallel one. Since this code 502 is conditionally parallel, the error is not easy to test for.
This example is a very simple error because it will likely usually fail, assuming that the code can be forced to execute in parallel (depending on how uninitialized variables are handled). However, there are an almost infinite number of synchronization and communication errors that can be introduced with OpenMP directives (this example is a communication error)—and many of these can result in intermittent failures depending on the relative timing and performance of the parallel code, as well as the execution order chosen by the scheduler.
Thus, there is a need for an improved processing cluster and associated tool chain.