In a buffer control element (BCE), also commonly called a "cache", data is stored in quanta called "lines". A line contains a plurality of sequential "words;" a word is the quantum on which the central processor (CP) operates. Lines are aligned on "line boundaries." That is to say, the 0th word in the line has a relative address of 0 in the line.
If the CP references a word that is not in the BCE, this event is called a "cache miss." When a cache miss occurs, the line that contains the word that generated the miss is fetched from the memory system, and is stored in the BCE for use by the processor.
FIG. 1 shows a prior art uniprocessor system 11 connected to a memory system 15. The uniprocessor system comprises a CP 10, a BCE 12, and a CP miss facility (CPMF) 14. The memory system 15 comprises a memory 16, and a memory miss facility (MMF) 18. Each miss facility (MMF and CPMF) contains a line buffer 20 and 24 that is used to buffer a line of data while it is being transferred from one system to the other. The miss facilities, preferred embodiments of which are shown in FIGS. 8 and 9, are preferably logic circuits on the memory or processor chips, each circuit including a line buffer and logic suitable for carrying out the functions to be described in greater detail below.
Note that within the memory system 15, the MMF 18 moves lines between itself and the memory 16 en masse via bus 26. And within the uniprocessor system 11, the CPMF 14 moves lines between itself and the BCE 12 en masse via bus 28. But the connection between the two miss facilities is generally much smaller than a line, e.g., it could be a single word.
In an ideal world, we could remove both miss facilities from the picture, connect the memory 16 directly to the BCE 12, and move lines back-and-forth between the memory and the BCE en masse. The memory and the BCE, however, reside on different chips, which are coupled by wordwidth buses. Thus, the reasons that the miss facilities are required are:
1. A buffering mechanism and some minimal control is required to disassemble lines into words, and reassemble the words back into lines when the line is moved across the word-wide interface.
2. Were there multiple processors or multiple memory systems in the picture, there would be a requirement to do some arbitration and buffering to handle situations in which multiple processors were requesting the use of the memory system.
Although a seemingly obvious way to improve speed might be to increase interface size from one word to one line, the width of the interface is limited by one of two things:
1. There must be one physical pin per bit in the interface. Thus, the physical packaging will determine the maximum number of pins that can be implemented.
2. The total number of pins that are implemented may all switch simultaneously. Thus, the power system must be able to provide enough peak current to the chips to allow this many signals to switch.
The physical limits of either of the above are typically much smaller than the desirable size of a cache line.
Note that when a cache miss occurs, the words in the line that are moved are not moved across the interface in arbitrary order. Rather, the CPMF 14 issues the miss to the memory system by using the address of the word that caused the miss. This address is passed on the address/control bus 22 shown in FIG. 1. The MMF 18 then fetches the appropriate line (corresponding to this address) from the memory 16, and buffers it in the MMF line buffer 20. The MMF 18 then transfers the words of the line (stored in line buffer 20) to the CPMF 14, via data bus 23, beginning with the word that generated the miss, and continuing in sequence, wrapping the address around to the beginning of the line, until all words have been transferred. The words are buffered in the CPMF line buffer 24 as they arrive. After all words have been transferred to the line buffer 24 in the CPMF, the line is moved to the BCE 12 en masse.
If the CP were stopped for the entire duration of every miss, the BCE would be called a "blocking cache." Blocking caches cause all action in their CP to stop until all data has been moved. Were it the case that the system had a blocking cache, then the order in which words were moved would make no difference to performance. There would also not be any need for a line buffer in the CPMF; instead, the data words could be put directly into the BCE as they arrived.
In a "nonblocking cache," the CP can continue to run while the miss is in progress. In this case, the line buffer in the CPMF serves the purpose of not taking BCE bandwidth away from the CP while the miss is in progress. The reason that the words of the miss are returned starting with the word that generated the miss is that the CP needs that word as soon as possible. Typically, the first few words that arrive are bypassed through the CPMF directly to the CP as soon as they arrive. This will usually allow the CP to perform work while the miss is still in progress.
With a blocking cache, the penalty for a miss is the sum of the following factors:
1. The time it takes for the CPMF to issue the miss to the MMF.
2. The time for the MMF to access the memory, and to put the line into the MMF line buffer. This is usually called the "memory access time."
3. The time for the MMF to move the line to the CPMF. This is equal to the number of words in a line times the bus cycle time.
4. The time for the requested word to be moved from the line buffer in the CPMF to the CP.
With a nonblocking cache, if it is the case that the miss has stopped CP action for logical reasons (i.e., the CP needs the word that missed before it can do anything else), then the CP can restart work as soon as the first word arrives at the CPMF line buffer.
Therefore, it is easy to see that the miss penalty can be separated into two independent terms:
1. A term that depends on the memory access time. This is called the "leading edge" delay. Roughly speaking, the leading edge delay is equal to the miss penalty for the same system if the line size were a single word.
2. A term that depends on the line size. This term is called the "trailing edge" delay. Roughly speaking, this term is the difference between the actual miss penalty and the leading edge delay (where "miss penalty" is defined as being the difference between the time it takes to access data when there is a miss and the time it takes to access data when there is not a miss). This term accounts for the fact that multiple words are moved.
If it were the case that misses always happened far apart in time, then there would be no interaction between misses. The miss penalty for a system having a nonblocking cache would be equal to the leading edge delay, i.e., there would be no negative effects of moving lines across a word-interface.
In fact, misses do cluster in time; it is frequently the case that upon receiving the first word back from the CPMF interface, the CP will immediately miss again. When this happens, the miss cannot be issued because the CPMF and the MMF are busy finishing the previous miss; the amount of time that they will remain busy is proportional to the number of words in a line. This is the principle contributor to trailing edge delay in systems like the one shown in FIG. 1.
FIG. 2 shows a prior art directory-based multiprocessor system (MP) that is a straightforward extrapolation of FIG. 1. In this system, there are a number of uniprocessor systems 50 each of which has a dedicated interface to the memory system. Each uniprocessor system is identical to the uniprocessor system in FIG. 1, but the memory system now contains a multiplicity of MMFs; one per uniprocessor system. Each MMF interacts with the uniprocessor system to which it is connected.
The memory 54 is accessed by all MMFs. Since more than one MMF can attempt to fetch a line from memory at the same time, some arbitration control is necessary, and is provided by arbitration control 56. This control allows one MMF to access memory at any time, and it chooses MMFs so as to allow accesses from all processors to be serviced eventually.
For the purposes of this discussion, FIG. 2 is merely a generalization of FIG. 1. As seen by any processor in the system, the system in FIG. 2 behaves identically to the system in FIG. 1 except that there is another factor that contributes to the leading edge delay. This factor is a queuing delay that depends on the number of processors in the system, and on the memory access time. Very simply, adding more processors to the system increases the aggregate traffic to the memory, hence its utilization, hence the probability that it is busy when a MMF requests service.
The trailing edge effect is not further exacerbated by a processor's being placed into a directory-based MP system like the one in FIG. 2.
Another type of known MP system is the shared-bus multiprocessor as shown in FIG. 3. This system is different than the one in FIG. 2 principally in that the processors do not have dedicated ports to the memory system. Instead, there is a single port to the memory system, and all processors must share this port. The arbitration control that was shown in FIG. 2 is still present in FIG. 3. In FIG. 2, that control arbitrated between the MMFs that vied for the memory; in FIG. 2, that control arbitrates between the CPMFs that vie for the bus.
Recall that the queuing delay in FIG. 2 was proportional to the memory access time, but was independent of the number of words in a line. This was because the shared resource in FIG. 2 was the memory, and the memory is accessed on the basis of lines. In essence, a miss could only be delayed by the leading-edge portion of the previous miss.
In FIG. 3, the queuing delay is proportional to the entire miss penalty. This is because the shared resource that is being contended is the bus--and the bus is used for the duration of the entire miss. Therefore, the number of words in a line not only contributes to the trailing edge of the processor that generates a miss, but it also adds a factor to the leading edges of the other processors' misses.
And finally, in the context of all three systems mentioned above, it should be apparent that speculative prefetching is not an easy feat. A "speculative prefetch" is a fetch that the hardware or software of a processor issues to the memory hierarchy in anticipation of needing the referenced data at a later time. This is done in the hope that if the need for data is anticipated far enough ahead, the miss for that data will be completed prior to the time that the data is actually needed. Therefore, the processor will not suffer miss penalty if it can prefetch effectively.
But note that a speculative prefetch utilizes the miss facilities, the memory, and the bus. For all intents and purposes, a speculative prefetch "looks like" any other miss insofar as it impacts the other miss-traffic in the system. Therefore, speculative prefetching can actually delay the real misses in a system, and can degrade the overall performance of the system even when it is done correctly.