Previous efforts to supply additional processing power by adding CPUs to a common bus resulted in a simple master-slave relationship between CPUs, called “non-coherent asymmetrical multiprocessing.” Though this architecture was simple, it soon reached a “premature” bottleneck because of poor task distribution among processors. The architectural limitations occur in both software and hardware. Early operating system (OS) software neither could run in multiprocessor (MP) systems nor could take full advantage of the increased processing power.
Additionally, most I/O drivers were “single-threaded,” which limited their execution to a single dedicated I/O processor. Initially, this was not a major problem because “non-coherent asymmetrical hardware” typically did not allow all processors access to all system resources. In general, non-coherent asymmetric hardware dedicates a single processor to I/O functions. The performance of this single I/O processor can, and often does, become the system bottleneck as it reaches its performance limits. Both the non-coherent asymmetric hardware and single-threaded software pose barriers to system performance. Thus, non-coherent asymmetric machines exhibit limited scalability because adding processors does not benefit a system that is limited by the performance of one of its processors.
The solution to this bottleneck was to redesign both the hardware and the software, which led to today's symmetric multiprocessors (SMPs) coupled with multithreaded software. An “SMP” is a system in which all processors are identical and all resources, specifically, all the memory and I/O space and interrupts, are equally accessible. While the symmetrical nature of SMP hardware eliminates any architectural barriers, the software must still efficiently divide the tasks among processors.
For performance reasons, most multiprocessor systems employ caches to reduce the latency of accessing the shared resources. Since caches are local copies of data, a hardware coherency protocol is used, in standard practice, for keeping the data in these caches consistent. Several multiprocessor systems offer dual-buses to provide for both communications with I/O resources and maintaining coherency. The bus used to maintain coherency is the ‘MP bus’. Whether or not you choose a multibus architecture, you must optimize this interface for performance. When speaking of performance and buses, the operative word is “coherency.” Coherency can take many forms in any multiprocessor. For example, an SMP can transmit coherency information across the MP bus for each change in cache state. Cache state maintenance is on blocks of data referred to as cache-lines. Cache-line buffers offer many benefits. First, they reduce the need for coherency information for each byte transferred. Second, they allow data transference, in a burst, over busses usually of smaller data width. Third, they reduce the size of the caching structure by reducing the amount of state information required for each byte in the cache (i.e. the cache tag). On an MP bus, the amount of data requested by a single command is limited to a cache-line. The limit is required to maintain coherency between system caches.
This invention applies to both the I/O bus as well as the coherent MP bus (i.e. the bus that maintains coherency between processors). The intention of the above description is not meant to limit the concept to SMPs. All MP types may use this or a similar means of achieving coherency.
Cache-line buffers allow transfer of a full cache line over the bus bridge, which raises two issues: What if the I/O device needs to transfer only a few bytes of data. Second, what if the transfer starts or ends in the middle of a cache line? You solve these problems in one of two ways: Some processor architectures allow only full cache-line transfers. In this case, you have no choice except to let the bus bridge perform a read-for-ownership cycle and then write the new data into the cache line. When the I/O device proceeds to the next cache line, the buffer must cast out the first line and read in the subsequent line. This approach consumes valuable bus bandwidth because a “worthless” cache-line read accompanies each burst write, which is needless when an I/O device is updating the entire cache line.
Additionally, the cache-line read causes each processor in the system to snoop its cache, potentially decreasing performance if the cache-line is modified. The snoop is still required to invalidate the line, even after a write-with-kill instruction is performed. If the cache-line is modified, the cache must write the modified data to memory before allowing the bus bridge to proceed with the read. These reads and writes are needless when new data from the bus bridge will overwrite the entire cache line. Prior art MP busses typically avoid the needless reads and writes by supporting a “Write with Kill” and “Read with Intent to Modify” operation. The “Write with Kill” operation informs the cache that the full cache-line requires writing to, thus allowing the cache straightforward invalidation of the line even though the line contains data. The bus bridge can then perform partial-word transfers until a cache-line boundary occurs. The bus bridge can then burst-write the cache lines without the performance of needless indicia reading and writing. It would be desirable, therefore, to be able to expand coherency actions for all cacheline requests by a burst command. A further goal would encompass separating the burst command, to allow the caches to be snooped. Lastly, it is preferable to address a cache design where an indication to the processor that multiple cache-lines are requested, is asserted.