The invention relates to an arrangement of coherent cache structures and techniques in a multiprocessor system that results in high cache hit ratios and high bus bandwidth.
Many manufacturers of computer-based products need to supply a range of related products or upgrade the size and performance of the earliest products in a series to remain competitive in the market. It is highly desirable that the development costs of upgraded products be minimized. To this end, it is highly desirable to be able to use the same software developed for the earlier products in the upgraded products, since software development usually is a major part of the development cost of the new computer-based product. Many computer-based products use the Motorola 68000 family of microprocessors, which execute a CISC (Complex Instruction Set Computer) instruction set. Upgrading performance, speed, and size of a product based on the Motorola 68000 microprocessors can be approached in several ways. One approach involves use of "scaleable" microprocessing techniques in which additional microprocessors are connected to the main bus to help share in the data processing workload. However, adding additional microprocessors results in rapidly increasing amounts of bus contention as the number of processors connected to the bus increases. Bus bandwidth therefore rapidly becomes the limiting factor in system performance as the number of microprocessors is increased. Various ways of reducing "bus traffic" have been proposed and/or used. One way to reduce bus traffic is to increase the so-called "cache hit ratio" by increasing the sizes of various cache memories. Another way is to use various improved cache coherency schemes that reduce bus traffic and especially minimize the number of accesses to the slow main memory. Another approach to increasing computer system performance has been to design systems which execute so-called RISC (Reduced Instruction Set Computer) instruction sets rather than CISC instruction sets. In RISC instruction sets all instructions have a single fixed length, and all use a so-called store-load architecture in which read and write operations from or to memory must only be accomplished with certain read and write instructions, whereas in CISC instruction sets it may be possible to include complex instructions that automatically effectuate read and write memory accesses. Although RISC instruction sets at the present state of the art can be executed with Average Instruction Times (AITs) of only about 1.5 machine cycles per instruction, the "inflexibility" of RISC instruction sets often means that a much larger number of instructions must be included in a program to accomplish a particular task. Furthermore, a system that uses a RISC instruction set is likely to substantially increase bus traffic, due to the much larger number of instructions required. This must be compensated in various ways, for example, by increasing the size, and hence the cost of the instruction cache. In contrast, CISC instruction sets typically can be executed with an AIT of 10-15 machine cycles, but the number of CISC instructions required to accomplish a particular task may be much less than if RISC instructions are used. While each approach offers distinct advantages, at the present time it is unclear which approach will ultimately prevail. However, it is clear that it would be highly desirable if the AIT of CISC instruction sets could be substantially reduced.
In so-called "tightly coupled" multiprocessor systems which are designed to decrease bus contention problems by decreasing bus traffic, it is necessary to ensure "cache coherency". That is, it is necessary to ensure that any processor access to any address always results in access to the most up-to-date copy of the line of data corresponding to that address. The state of the art is generally indicated in U.S. Pat. No. 4,622,631 "Data Processing System Having a Data Coherence Solution", "Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model" by James Archibald and Jean-Loup Baer, ACM Transactions on Computer Systems, Volume 4, No. 4, November 1986, pages 273-298, "A New Solution to Coherence Problems in Multicache Systems", by Lucien M. Censier and Paul Feautrier, IEEE Transactions on Computers, Volume C-27, No. 12, December 1978, pages 1112-1118, "Effects of Cache Coherency in Multiprocessors" by Michael DuBois and Faye A. Briggs, IEEE Transactions on Computers, Volume C-31, No. 11, November 1982, pages 1083-1099, "Data Coherence Problems in a Multicache System" by Ywei C. Yen, David W. Lyen, and King-Sun Foo, IEEE Transactions on Computers, Volume C-34, No. 1, January 1985, pages 56-65, and "Using Cache Memory to Reduce Processor-Memory Traffic", by James R. Goodman, Association for Computing Machinery, 10th Annual Symposium on Computer Architecture, June, 1983.
In U.S. Pat. No. 4,622,631 a "floating ownership" scheme is described under which the ownership of a line of data is passed along between processors and the Global Memory according to a set of rules. The Global Memory "owns" the line of data until it is written into. Then ownership passes to the writing processor. The owner processor is the only one allowed to modify the line of data, and is responsible for furnishing the latest copy of a line of data to all other requesters. The system disclosed in U.S. Pat. No. 4,622,631 requires that a status bit be stored in the Global Memory for each line of data in the Global Memory. That disclosed system requires access to the slow Global Memory and to the status bit of the addressed line of data therein to enable the system to determine whether the present line of data in the Global Memory is to be read or written into. The system disclosed in U.S. Pat. No. 4,622,631 also requires that "bus cycles" be utilized before a processor modifies a line of valid data in its associated cache in order to ensure that any other copy of the present line of data in any other cache is destroyed before it is modified in the cache being accessed. It would be very desirable to avoid the delays caused by the need to access stored status bits in the Global Memory to preserve cache coherency and to avoid the delays caused by the need to use bus cycles to ensure all other copies of the present line of data are destroyed before that line of data in the cache accessed by the requesting processor is modified.
In prior "virtual" instruction caches, every time there is a "context switch" of CPU operation from one user to another it is necessary to clear the entire instruction cache and load the instructions of the new program into the instruction cache. Such clearing and reloading is very time-consuming and causes considerable degradation of overall system performance in a time-sharing system in which there is a large number of such context switches. It would be desirable to find a cost effective way of reducing the number of cache misses caused by context switches in a virtual machine.
In a virtual machine, if a large percentage of operand accesses are stack accesses, there can be a large number of operand cache misses that occur when the accessed line is not present in the operand cache because it is being loaded from the slow Global Memory into the operand cache. Up to now, the expedient of simply increasing the size of the operand cache so that it can hold all of the instructions and data needed by the CPU for a long period of time has been unacceptably costly. It would be highly desirable to find an efficient, cost-effective way of significantly reducing the number of operand cache misses caused by stack accesses.
In a virtual computer the I/O controller uses I/O caches which need to conform to the overall cache coherency protocol of the virtual computer. I/O controllers and I/O caches need to interface with both slow, sequential multiple data streams from various I/O channels and with a high speed parallel system bus. In prior systems, a large number of I/O cache misses may occur, and each such I/O cache misses result in substantial degradation of system performance. It would be highly desirable to find a cost-effective technique for greatly reducing the number of I/O cache misses in a virtual system having a large number of I/O channels.