Field
The present invention relates to the field of computer processors and methods of computing. More specifically, embodiments herein relate to computer processor architecture implementing multiple control layers.
Description of the Related Art
At one time, computer performance grew proportionally to transistor density. In the mainframe era, for example, the major limitation to performance using single transistors or MSI/LSI chips was physical size, limited by number of transistors per cubic foot of mainframe cabinets. The larger the physical size the slower the cycle time.
In the era of the processor on a chip, yield became the major limiting factor. Along with finer silicon feature sizes more transistors per die became available at acceptable chip yield enabling; first going to wider word sizes, than to simple pipelining (one instruction per cycle) followed by Instruction Level Parallelism (ILP, up to 4 instructions per cycle). The smaller feature size also enabled higher clock frequency and less power consumption per transistor junction, all of which combined to offer much higher performance as level of integration increased.
Past about 2005 the picture has changed as evident by the arrival of multi core chips. Instead of getting four times (or more) faster processor for quadrupling the transistors per die (along with doubling cycle time) as was the case in moving from 16 bits to 32 bit processors, presently when higher transistor count is used to implement ILP register architecture processors, the design brings diminishing returns on performance (this issue is known in the industry as Pollack's Rule). Thus economics has encouraged the industry into moving to multi core in order to take advantage of the available transistor count per die. Also it is noted that for FP programs, ILP machines may achieve 0.5 FP per cycle, where the theoretical limit for using one adder and one multiplier is 2.0, to go further multiple FU copies are thus required. Increasing performance by including multiple copies of functional unit may require shadow register structures whose complexity may far exceed the complexity of the systems described herein. Limits to improvements of scalar performance of computer hardware have been characterized as “walls”, including, for example, a power wall, a memory wall, and an instruction level parallelism (“ILP” wall).
While the approach presented herein overcomes disadvantages both in the “Memory wall” and the “ILP wall” we will concentrate our presentation on the effects on “ILP wall” effects first. Some effects improving the “Memory wall” issues will also be noted.
The inability of processor architecture to take advantage of the increased transistor per die to gain performance advantage (the “ILP Wall”) is linked to the register machine namespace interfering in both micro-parallelism and macro-parallelism. Regarding micro-parallelism having “registers” as part of the processor's namespace serializes processes that are “embarrassingly parallel”.
Present macro-parallelism is limited in some respects due to processor's need to share information and the lack of efficient mechanisms responsible for the integrity of shared variables, an issues addressed herein by the Mentors.
The larger the systems, the larger the memory one wants to access. To access more memory, one has to go off chip. Thus limits on the speed of the clock cycle (power wall) and pin to pin interconnect (memory wall) are at play due to multi chip interconnects and the disparity in cycle times among the different layers of the memory hierarchy (memory wall). Moreover, existing computer architecture may not be expandable to efficiently take advantage of the larger available transistor count. Existing ILP register architecture may be effectively limited by what has been referred to as an “ILP wall”. In register machines the unavailability of operands is mostly caused not by the intrinsic data dependency relations in the source HLL program but it is caused by the effects of the register architecture's namespace management. The more one attempts to speed up performance by the use of parallel operations intrinsic in the original HLL algorithm, the more interference in the process is due to the “register” namespace mechanism. Techniques like shadow register provide some relief but they soon become too complex to provide true solutions. In the Von Neumann model the namespace is [Memory]+PC, the only “named” entities are operands and instructions addresses in memory and the program counter. There are two basic problem with the original Von Neumann architecture (the “three address machine architecture” A<=B+C for example see SWAC), the first is that the architecture requires four memory accesses delays per instruction. One memory access for the instruction fetch, two for fetching two data operands and one for storing the results. The second problem is that as memory address size increases, instruction size increases by three fold as each instruction contain three addresses. Typical register architectures reduced the number of memory access per instruction to two, one for instruction and one for data, and each instruction contains only a single memory address keeping instruction size manageable. In RISC machines memory accesses are even less as most instructions do not access memory. However register architecture significantly complicated the namespace. The namespace in a register machine is: [Memory]+Registers+PC+PSDW+CC (condition codes).
The introduction of vector registers improved performance in programs that exhibited micro parallelism. However in the long run vector registers further complicated namespace and the software mobility issues. The namespace in vector machines is: [Memory]+PC+Registers+Vector Registers+PSDW+CC. The namespace mechanism, for register architectures and registers+vector architectures is a major factor in the creation of the “ILP Wall”. Once cache is introduced into the picture, cache solves the same operand access delays (staging) problem that registers and vector register originally solved. Operands can be used within one or two cycles in one-operand-per-cycle stream from either the cache, the registers or the vector registers. From that point on, registers and vector registers may further complicate the namespace and coherency issues. Coherency may be lost, for example, when the program changes operand values already staged in cache, a register or a vector registers. Therefore, once one introduces caches into the architecture, the real advantage of register architecture over Von Neumann architecture is in smaller instruction size and thus possibly smaller program size, advantages that can be overcome by namespace mapping methods.
For historical and other reasons both computer machine languages and HLLs do not include the concept and semantics of “plural form” as part of the language for expressing algorithms. For insight at where HLLs did propose (FORTRAN) extensions that do recognize this subject please see “FORALL in Parallel” and “FORALL In Synch” in Modula 2.
In simple algorithms (some micro-parallelism type codes) the existence of parallelism may be deduced by the compiler from the “N times singular” form of the DO or FOR loops.
For insight, “Company about face” is linguistically a plural language form of an instruction in English. While “DO I=1, N; Soldier (I) about face; END DO” is an “N times singular” linguistic form. A characteristic effect of the use of “N times singular” form is that it typically transforms a parallel process to a serial process.
The lack of the explicit “plural form” in both machine languages and most HLLs blocks (1) having dialogs between programmer and compiler regarding the parallel properties of the algorithm as well as (2) addressing parallel properties of complex codes (midlevel and macro parallelism) whose parallel properties cannot be deduced by the compiler but need to be explicitly implemented by the programmer Presently parallel operations may be done, for example, by assigning parallel tasks to different code threads, see CC++ PARAFOR where each iteration of a “PARAFOR” creates a new thread which executes in parallel with all other iteration bodies. Existing ILP register machine and their predecessors may either convert micro-parallel actions into thread structures, appropriate for macro parallel operations but cumbersome for micro parallelism as is the case of PARAFORE. The compiling process removes micro parallelism information and convers the information into a strictly singular (sequential) machine language form. In case of Vector and VLIW machines, the parallelism information is strictly used in the compiler to directly control very specific vector or VLIW hardware structure(s). Those structures may be a good fit for processing micro parallel applications, but they also may produce clumsy code that is hard to debug and very hard to transport.