Computer speed may be increased using two general approaches: increase instruction execution speed or do more instruction execution in parallel. As instruction execution speed approaches the limits of electron mobility in silicon, parallelism becomes the best alternative to increasing computer speed.
Previous attempts at parallelism have included:
1. Overlapping next instruction fetching with current instruction execution.
2. Instruction pipelining. An instruction pipeline breaks each instruction into as many pieces as possible and then attempts to map sequential instructions into parallel execution units. Theoretical maximum improvement is seldom achieved due to the inefficiencies of multi-step instructions, inability of many software programs to provide enough sequential instructions to keep the parallel execution units filled, and the large time penalty paid when a branch, loop, or case construct is encountered requiring the refilling of the execution units.
3. Single instruction multiple data or SIMD. This type of technique is found in the Intel SSE instruction set, as implemented in the Intel Pentium 3 and other processors. In this technique, a single instruction executes on multiple data sets. This technique is useful only for special applications such as video graphics rendering.
4. Hypercube. This technique employs large two-dimensional arrays and sometimes three-dimensional arrays of processors and local memory. The communications and interconnects necessary to support these arrays of processors inherently limits them to very specialized applications.
A pipeline is an instruction execution unit consisting of multiple sequential stages that successively perform a piece of an instruction's execution, such as fetch, decode, execute, store, etc. Several pipelines may be placed in parallel, such that program instructions are fed to each pipeline one after another until all pipelines are executing an instruction. Then the instruction filling repeats with the original pipeline. When N pipelines are filled with instructions and executing, the performance effect is theoretically the same as an N times increase in execution speed for a single execution unit.
Successful pipelining depends upon the following:
1. An instruction's execution must be able to be defined as several successive states.
2. Each instruction must have the same number of states.
3. The number of states per instruction determines the maximum number of parallel execution units.
Since pipelining can achieve performance increases based on the number of parallel pipelines, and since the number of parallel pipelines is determined by the number of states in an instruction, pipelines encourage complex multi-state instructions.
Heavily pipelined computers very seldom achieve performance anywhere near the theoretical performance improvement expected from the parallel pipeline execution units.
Several reasons for this pipeline penalty include:
1. Software programs are not made up of only sequential instructions. Various studies indicate changes of execution flow occur every 8-10 instructions. Any branch that changes program flow upsets the pipeline. Attempts to minimize the pipeline upset tend to be complex and incomplete in their mitigation.
2. Forcing all instructions to have the same number of states often leads to execution pipelines that satisfy the requirements of the lowest common denominator (i.e., the slowest and most complex) instructions. Because of the pipeline, all instructions are forced into the same number of states, regardless of whether they need them or not. For example, logic operations (such as AND or OR) execute an order of magnitude faster than an ADD, but often both are allocated the same amount of time for execution.
3. Pipelines encourage multi-state complex instructions. Instructions that might require two states are typically stretched to fill 20 states because that is the depth of the pipeline. (The Intel Pentium 4 uses a 20 state pipeline.)
4. The time required for each pipeline state must account for propagation delays through the logic circuitry and associated transistors, in addition to the design margins or tolerances for the particular state.
5. Arbitration for pipeline register and other resource access often reduces performance due to the propagation delays of the transistors in the arbitration logic.
6. There is an upper limit on the number of states into which an instruction may be split before the additional state actually slows down execution, rather than speeds it up. Some studies have suggested that the pipeline architecture in the last generation of Digital Equipment Corporation's Alpha processor exceeded that point and actually performed slower that the previous, shorter pipelined version of the processor.
Splitting Apart the Pipelines
One perspective to re-factoring CPU design is to think of pipelined execution units that are then split into multiple (N) simplified processors. (Registers and some other logic may need to be duplicated in such a design.) Each of the N simplified processors would have the following advantages over the above-discussed pipelined architectures:
1. No pipeline stalls. No branch prediction necessity.
2. Instructions could take as much or as little time as they need, rather than all being allocated the same execution time as the slowest instruction.
3. Instructions could be simplified by reducing the necessary execution states, thereby reducing the pipeline penalty.
4. Each state eliminated from the pipeline could eliminate propagation delays and remove design margins necessary for the state.
5. Register arbitration could be eliminated.
Furthermore, a system with N simplified processors could have the following advantages over a pipelined CPU:
1. The limit of maximum pipeline parallelism would be eliminated.
2. Unlike a pipelined processor, multiple standalone processors could be selectively powered down to reduce power consumption when not in use.
Other Problems with Current Approaches to Parallelism
Many implementations of parallelism succumb to the limits of Amdahl's Law.
Acceleration through parallelism is limited by overhead due to non-serializable portions of the problem. In essence, as the amount of parallelism increases, the communications necessary to support it overwhelms the gains due to the parallelism.
Stoplight Sitting at Redline
Another inefficiency of current processors is the inability of scaling the computing power to meet the immediate computing demand. Most computers spend most of their time waiting for something to happen. They wait for I/O, for the next instruction, for memory access, or sometimes human interface. This waiting is an inefficient waste of computing power. Furthermore, the computer time spent waiting often results in increased power consumption and heat generation.
The exceptions to the waiting rule are applications like engine controllers, signal processors, and firewall routers. These applications are excellent candidates for parallelism acceleration due to the predefined nature of the problem sets and solution sets. A problem that requires the product of N independent multiplications may be solved faster using N multipliers.
The perceived performance of a general purpose computer is really its peak performance. The closest a general purpose computer gets to being busy is running a video game with a rapid screen refresh, compiling a large source file, or searching a database. In an optimal world, the video rendering would be factored into special purpose, shading, transforming, and rendering hardware. One method of factoring the programming to such special purpose hardware is the use of “threads.”
Threads are independent programs that are self contained and infrequently communicate data with other threads. A common use of threads is to collect data from slow realtime activity and provide the assembled results. A thread might also be used to render a change on a display. A thread may transition through thousands or millions of states before requiring further interaction with another thread. Independent threads present an opportunity for increased performance through parallelism.
Many software compilers support the generation and management of threads for the purposes of factoring the software design process. The same factoring will support multiple CPU parallel processing via the technique of Thread Level Parallelism implemented in a Thread Optimized Microprocessor (TOMI) of the preferred embodiment.
Thread Level Parallelism
Threading is a well understood technique for factoring software programs on a single CPU. Thread level parallelism can achieve program acceleration through use of a TOMI processor.
One significant advantage of a TOMI processor over other parallel approaches is that a TOMI processor requires minimal changes to current software programming techniques. New algorithms do not need to be developed. Many existing programs may need to be recompiled, but not substantially rewritten.
An efficient TOMI computer architecture should be built around a large number of simplified processors. Different architectures may be used for different types of computing problems.
Fundamental Computer Operations
For a general purpose computer, the most common operations in order of declining frequency are: Loads and stores; Sequencing; and Math and logic.
Load and Store
The parameters of LOAD and STORE are the source and destination. The power of the LOAD and STORE is the range of source and destination (for example, 4 Gbytes is a more powerful range than 256 bytes). Locality relative to the current source and destination is important for many data sets. Plus 1, minus 1 are the most useful.
Increasing offsets from the current source and destination are progressively less useful.
LOAD and STORE may also be affected by the memory hierarchy. A LOAD from storage may be the slowest operation a CPU can perform.
Sequencing
Branches and loops are the fundamental sequencing instructions. Instruction sequence changes based on a test is the way computers make decisions.
Math and Logic
Math and logic operations are the least used of the three operations. Logic operations are the fastest operations a CPU can perform and can require as little as a single logic gate delay. Math operations are more complex since higher order bits depend on the results of lower order bit operations. A 32-bit ADD can require at least 32 gate delays, even with carry lookahead. MULTIPLY using a shift and add technique can require the equivalent of 32 ADDs.
Tradeoffs of Instruction Size
The perfect instruction set would consist of op-codes that are large enough to select infinite possible sources, destinations, operations, and next instructions. Unfortunately the perfect instruction set op-codes would be infinitely wide and the instruction bandwidth would therefore be zero.
Computer design for high-instruction bandwidth involves the creation of an instruction set with op-codes able to efficiently define the most common sources, destinations, operations, and next instructions with the fewest op-code bits.
Wide op-codes lead to high instruction bus bandwidth requirements and the resulting architecture will be quickly limited by the Von Neuman bottleneck, wherein the computer's performance is limited by the speed with which it fetches instructions from memory.
If a memory bus is 64 bits wide, one could fetch a single 64-bit instruction, two 32-bit instructions, four 16-bit instructions, or eight 8-bit instructions in each memory cycle. A 32-bit instruction had better be twice as useful as a 16-bit instruction, since it cuts the instruction bandwidth in half.
A major objective of instruction set design is to reduce instruction redundancy. In general an optimized efficient instruction set takes advantage of the locality of both instructions and data. The easiest instruction optimizations have long since been done. For most computer programs, the most likely next instruction is the sequentially next instruction in memory. Therefore instead of every instruction having a next instruction field, most instructions assume the next instruction is the current instruction +1. It is possible to create an architecture with zero bits for source and zero bits for destination.
Stack Architectures
Stack architecture computers are also called zero operand architectures. A stack architecture performs all operations based on the contents of a push down stack. A two operand operation would require both operands be present on the stack. When the operation executes, both operands would be POP'd from the stack, the operation would be performed, and the result would be PUSH'd back on the stack. Stack architecture computers can have very short op-codes since the source and destination are implied as being on the stack.
Most programs require the contents of global registers that may not always be available on the stack when needed. Attempts to minimize this occurrence have included stack indexing that allows accessing operands other than those on the top of the stack. Stack indexing requires either additional op-code bits resulting in larger instructions or additional operations to place the stack index value on the stack itself. Sometimes one or more additional stacks are defined. A better but not optimal solution is a combination stack/register architecture.
Stack architecture operation is also often redundant in ways that defy obvious optimizations. For example, each POP and PUSH operation has the potential to cause a time wasting memory operation as the stack is manipulated in memory. Furthermore, the stack operation may consume an operand that may be immediately needed for the next operation, thereby requiring operand duplication with the potential of yet another memory operation. Take for example, the operation of multiplying all the elements of a one dimensional array by 15.
On a stack architecture, this is implemented by:                1. PUSH start address of array        2. DUPLICATE address (So we have the address to store the result to the array.)        3. DUPLICATE address (So we have the address to read from the array.)        4. PUSH INDIRECT (PUSH the contents of the array location pointed to by the top of stack)        5. PUSH 15        6. MULTIPLY (15 times the array contents we read in line 3)        7. SWAP (Get the array address on the top of the stack for the next instruction.)        8. POP INDIRECT (POPs the multiplication result and stores it back to the array.)        9. INCREMENT (Point to the next array item.)        10. Go to step 2 until the array is done.        The loop counter in line 9 would require an additional parameter. In some architectures, this parameter is stored on another stack.        
On a hypothetical register/accumulator architecture, the example is implemented by:
1. STORE POINTER start address of array                2. READ POINTER (Read the contents of the address pointed to into an accumulator.)        3. MULTIPLY 15        4. STORE POINTER (Store the result into the address pointed to.)        5. INCREMENT POINTER        6. Go to line 2 until the array is done.        
Compare the nine steps for the stack architecture versus the five steps for the register architecture for the above example. Furthermore, the stack operation has at least 3 possible opportunities for an extra memory access due to stack operation. The loop control of the hypothetical register/accumulator architecture could easily be handled in a register.
Stacks are useful for evaluating expressions and are used as such in most compilers. Stacks are also useful for nested operations such as function calls. Most C compilers implement function calls with a stack. However, without supplementing by general purpose storage, a stack architecture requires lots of extra data movement and manipulation. For optimization purposes, stack PUSH and POP operations should also be separated from math and logic operations. But as can be seen from the example above, stacks are particularly inefficient when loading and storing data repeatedly, since the array addresses are consumed by the PUSH INDIRECT and POP INDIRECT.
In one aspect, the invention comprises a system comprising: (a) a plurality of parallel processors on a single chip; and (b) computer memory located on the chip and accessible by each of the processors; wherein each of the processors is operable to process a de minimis instruction set, and wherein each of the processors comprises local caches dedicated to each of at least three specific registers in the processor.
In various embodiments: (1) the size of each of the local caches is equivalent to one row of random access memory on the chip; (2) the at least three specific registers with an associated cache include an instruction register, source register, and destination register; (3) the de minimis instruction set comprises seven basic instructions; (4) each of the processors is operable to process a single thread; (5) an accumulator is an operand for every instruction, except an increment instruction; (6) a destination for each basic instruction is always an operand register; (7) three registers auto-increment and three registers auto-decrement; (8) the instruction set comprises no BRANCH instruction and no JUMP instruction; (9) each instruction is at most 8 bits in length; and (10) a single master processor is responsible for managing each of the parallel processors.
In another aspect, the invention comprises a system comprising: (a) a plurality of parallel processors on a single chip; and (b) computer memory located on the chip and accessible by each of the processors, wherein each of the processors is operable to process an instruction set optimized for thread-level parallel processing.
In various embodiments: (1) each of the processors is operable to process a de minimis instruction set; (2) each of the processors comprises local caches dedicated to each of at least three specific registers in the processor; (3) the size of each of the local caches is equivalent to one row of random access memory on the chip; (4) the at least three specific registers include an instruction register, source register, and destination register; (5) the de minimis instruction set comprises seven basic instructions; (6) each of the processors is operable to process a single thread; (7) a single master processor is responsible for managing each of the parallel processors; and (8) the de minimis instruction set includes a minimal set of instruction extensions to optimize processor operation and facilitate software compiler efficiency.
In another embodiment, the invention comprises a method of thread-level parallel processing utilizing a plurality of parallel processors, a master processor, and a computer memory on a single chip, wherein each of the plurality of processors is operable to process a de minimis instruction set and to process a single thread, comprising: (a) allocating local caches to each of three specific registers in each of the plurality of processors; (b) allocating one of the plurality of processors to process a single thread; (c) processing each allocated thread by the processors; (d) processing the results from each thread processed by the processors; (e) de-allocating one of the plurality of processors after a thread has been processed; and (f) the de minimis instruction set includes a minimal set of instructions to optimize processor management.
In various embodiments the de minimis instruction set comprises seven basic instructions and the instructions in the de minimis instruction set are at most 8 bits in length. The de minimis instruction set may also include a set of extension instructions, beyond the seven basic instructions, that optimize the internal operation of the TOMI CPU and help optimize the execution of software program instructions being executed by a TOMI CPU and optimize the operation of software compilers for the TOMI CPU.
Embodiments of the invention with multiple TOMI CPU cores may also include a limited set of processor management instructions used for managing the multiple CPU cores.
In another aspect, the invention comprises a system comprising: (a) a plurality of parallel processors mounted on a memory module; (b) an external memory controller; and (c) a general purpose central processing unit; wherein each of the parallel processors is operable to process an instruction set optimized for thread-level parallel processing.
In various embodiments: (1) each of the parallel processors is operable to process a de minimis instruction set; (2) one or more bits allocated in a memory mode register is operable to enable or disable one or more of the parallel processors; (3) the memory module is a dual inline memory module; (4) each of the processors is operable to process a single thread; (5) a plurality of threads share data through shared memory; (6) a plurality of threads share data through one or more shared variables; (7) the memory module is one or more of: DRAM, SRAM, and FLASH memory; (8) at least one of the parallel processors is treated as a master processor and other of the parallel processors are treated as slave processors; (9) each processor has a clock speed, and each processor other than the master processor is operable to have the processor's clock speed adjusted to optimize either performance or power consumption; (10) each processor is operable to be treated as either a master processor or a slave processor; (11) the master processor requests processing by several slave processors, waits for output from the several slave processors, and combines the output; (12) the master processor combines output from the several processors as the output is received from each of the several processors; (13) low power dissipation is provided by enabling one or more of the parallel processors to be stopped; and (14) each of the parallel processors is associated with a program counter and is operable to be stopped by writing all ones (1's) to a program counter associated with the parallel processor.
In another aspect, the invention comprises a system comprising a plurality of parallel processors embedded into a dynamic random access memory (DRAM) die, the plurality of parallel processors in communication with an external memory controller and an external processor, and wherein each of the parallel processors is operable to process an instruction set optimized for thread-level parallel processing.
In various other embodiments: (1) the die is packaged with a DRAM pinout; (2) the parallel processors are mounted on a dual inline memory module; (3) the system operates as DRAM except when the processors are enabled through a DRAM mode register; (4) the external processor is operable to transfer data and instructions from an associated permanent storage device to the DRAM; (5) the permanent storage device is FLASH memory; and (6) the external processor is operable to provide an input/output interface between the parallel processors and external devices.
In another aspect, the invention comprises a system comprising: (a) a plurality of processors on a single chip; and (b) computer memory located on the chip and accessible by each of the processors, wherein each of the processors is operable to process a de minimis instruction set, and wherein each of the processors comprises local caches dedicated to each of at least three specific registers in the processor.
In various other embodiments: (1) the size of each of the local caches is equivalent to one row of random access memory on the chip; (2) each processor accesses an internal data bus of random access memory on the chip and the internal data bus has a width of one row of the random access memory; (3) the width of the internal data bus is 1024, 2048, 4096, 8192, 16328, or 32656 bits; (4) the width of the internal data bus is an integer multiple of 1024 bits; (5) the local caches dedicated to each of at least three specific registers in the processor are operable to be filled or flushed in one memory read or write cycle; (6) the de minimis instruction set consists essentially of seven basic instructions; (7) the basic instruction set includes ADD, XOR, INC, AND, STOREACC, LOADACC, and LOADI instructions; (8) each instruction in the de minimis instruction set is at most 8 bits in length; (9) the de minimis instruction set comprises a plurality of instruction extensions to optimize execution of instruction sequences on a processor, further wherein such instruction extensions consists essentially of less than 20 instructions; (10) each instruction extension is at most 8 bits in length; (11) the de minimis instruction set comprises a set of instructions to selectively control the plurality of processors on the chip; (12) each processor control instruction is at most 8 bits in length; (13) the plurality of processors are manufactured on the chip with the computer memory located on the chip using a semiconductor manufacturing process designed for monolithic memory devices; (14) the semiconductor manufacturing process uses less than 4 layers of metal interconnect; (15) the semiconductor manufacturing process uses less than 3 layers of metal interconnect; (16) integration of the plurality of processors into the computer memory circuit results in less than 30% increase in chip die size; (17) integration of the plurality of processors into the computer memory circuit results in less than 20% increase in chip die size; (18) integration of the plurality of processors into the computer memory circuit results in less than 10% increase in chip die size; (19) integration of the plurality of processors into the computer memory circuit results in less than 5% increase in chip die size; (20) less than 250,000 transistors are used to create each processor on the chip; (21) the chip is manufactured using a semiconductor manufacturing process using less than 4 layers of metal interconnect; (22) each of the processors is operable to process a single thread; (23) an accumulator is an operand for every basic instruction, except an increment instruction; (24) a destination for each basic instruction is always an operand register; (25) three registers auto-increment and three registers auto-decrement; (26) each basic instruction requires only one clock cycle to complete; (27) the instruction set comprises no BRANCH instruction and no JUMP instruction; and (28) a single master processor is responsible for managing each of the parallel processors.
In another aspect, the invention comprises a system comprising: (a) a plurality of parallel processors on a single chip; and (b) computer memory located on the chip and accessible by each of the processors, wherein each of the processors is operable to process an instruction set optimized for thread-level parallel processing; and wherein each processor accesses the internal data bus of the computer memory on the chip, and the internal data bus is no wider than one row of the memory.
In various embodiments: (1) each of the processors is operable to process a de minimis instruction set; (2) each of the processors comprises local caches dedicated to each of at least three specific registers in the processor; (3) the size of each of the local caches is equivalent to one row of computer memory on the chip; (4) the at least three specific registers include an instruction register, source register, and destination register; (5) the de minimis instruction set consists essentially of seven basic instructions; (6) the basic instruction set includes ADD, XOR, INC, AND, STOREACC, LOADACC, and LOADI instructions; (7) each instruction in the instruction set is at most 8 bits in length; (8) each of the processors is operable to process a single thread; (9) a single master processor is responsible for managing each of the parallel processors; (10) the de minimis instruction set comprises a plurality of instruction extensions to optimize execution of instruction sequences on a processor, further wherein such instruction extensions comprise less than 20 instructions; (11) each instruction extension is at most 8 bits in length; (12) the de minimis instruction set comprises a set of instructions to selectively control the plurality of processors on the chip; (13) each processor control instruction is at most 8 bits in length; and (14) the plurality of processors are capable of being manufactured on the chip with the computer memory located on the chip using a semiconductor manufacturing process designed for monolithic memory devices.
In another aspect, the invention comprises a method of thread-level parallel processing utilizing a plurality of parallel processors, a master processor, and a computer memory on a single chip, wherein each of the plurality of processors is operable to process a de minimis instruction set and to process a single thread, comprising: (a) allocating local caches to each of three specific registers in each of the plurality of processors; (b) allocating one of the plurality of processors to process a single thread; (c) processing each allocated thread by the processors; (d) processing the results from each thread processed by the processors; and (e) de-allocating one of the plurality of processors after a thread has been processed.
In various embodiments: (1) the de minimis instruction set consists essentially of seven basic instructions; (2) the basic instructions comprise ADD, XOR, INC, AND, STOREACC, LOADACC, and LOADI instructions; (3) the de minimis instruction set comprises a set of instructions to selectively control the plurality of processors; (4) each processor control instruction is at most 8 bits in length; (5) the method further comprises the step of each processor accessing the computer memory using an internal data bus of the memory, wherein the internal data bus is the width of one row of memory on the chip; and (6) each instruction in the de minimis instruction set is at most 8 bits in length.
In another aspect, the invention comprises a system comprising: (a) a plurality of processors embedded in a memory chip that is compatible with electronics industry standard device packaging and pin layout for such memory devices; and (b) one or more of the processors may be activated through information transmitted to a memory mode register of the memory chip, wherein the memory chip is functionally compatible with the operation of industry standard memory devices except when one or more of the processors are activated through the memory mode register.