The single chip embodiment of a shared memory multiprocessor has been well known in the art for several years. Chips with multiprocessors have been built, but to date none with symmetric multiprocessors, i.e., where all processors have identical capability. There are several reasons for the lack of commercial use of this concept relating to the software availability for multiprocessors as well as the difficulty of fitting more than one processor with a shared memory interface on a single die of reasonable size.
FIG. 1 shows a simple superscalar processor 10 with a single instruction queue and single, unified cache. This invention can be applied to much more complicated processor structure. Instructions are fetched from the single ported cache array 12 and loaded into the instruction queue 14 via the result bus 16. The width of the data read from the cache array and the width of the result bus are much greater than a single 32 bit instruction, e.g. 256 bits. This would allow the dispatch of up to 4 instructions per cache read cycle, which for this illustration will be deemed a single machine cycle. Three execution units receive ready instructions from the dispatch logic, at a rate of up to one instruction per execution unit per machine cycle. These units are a branch unit 18, an integer execution unit with integer register file 20 and a floating point unit with floating point register file 22. Memory instructions which require the calculation of an effective address are dispatched to the integer unit and the resulting address is then sent to the memory management unit (MU) 24. Similarly, instruction addresses are generated in the Instruction Fetch Unit 26. Both the MMU and the Instruction Fetch Unit send the addresses to the cache tags 28 and therefore to the array addressing logic. A hit in the tags results in data from the array being placed on the result bus 16. A miss in the cache tags results in a memory request being put on the memory queue 30 and transmitted via the Bus Interface Unit to the external Address and Data bus 32. Each of these units is capable of operating simultaneously resulting in a pipelined superscalar processor with a peak execution rate of 3 instructions per cycle. If the floating point execution unit 22 has a latency of 2 cycles, the pipeline sequence for a single floating point instruction would be fetch, dispatch, decode, execute 1, execute 2, and writeback.
FIG. 2 shows this simple processor 10a extended with the addition of a SIMD style saturating arithmetic multimedia instruction execution unit 34 with its register file. Operation of this unit is similar to the operation of the floating point unit 22. Typically, instructions for this unit do not occur close to floating point operations in the execution sequence of ordinary programs. In fact, in the Intel (tm) MMX architecture, the MMX registers are mapped into the floating point register space and an explicit context switch is needed to enable MMX instructions after floating point instructions have been executed and vice versa.
Building a multiprocessor on a single die would result in a chip similar to those illustrated in FIGS. 3 or 4. Obvious common elements such as the interface 36 to the external bus are shared. Conventional design would also have only one instance on the chip of clock generation, test processor or boundary scan controller, etc. Note that the tag logic and MMU are assumed to have the correct data sharing protocols (e.g. MESI) logic for a multiprocessor. More complex changes are possible when putting more than one processor on a die, such as adding additional levels of caching or more external buses. Such a chip would be large and not have twice the performance of a single processor due to well known multiprocessor effects. However, if the area could be reduced significantly then the instructions executed per second per silicon area, i.e. cost/performance would become equal or better than a uniprocessor. A need exists for a method of splitting these less used functions away from the microprocessor for shared use with other symmetric processors.