A microprocessor typically communicates with a computer system via a shared computer system bus known as a “front-side bus” (FSB). However, as microprocessor performance is improved and as computer systems use multiple processors interconnected along the same FSB, the FSB has become a performance bottleneck.
One approach to this problem is the use of point-to-point (PtP) links between the various processors in a multiple processor system. PtP links are typically implemented as dedicated bus traces for each processor within the multi-processor network. Although typical PtP links provide more throughput than FSB, the latency of PtP links can be worse than the latency of FSB.
Latency of PtP can particularly impact the performance of store operations performed by microprocessor, especially in microprocessor architectures requiring strong ordering among the store operations. Because of the strong ordering requirements, for example, previously issued store operations must typically be accessible, or at least detectable, to other bus agents within the system before later store operations may be issued by the processor. The detectability of an operation, such as a store, load, or other operation, to other bus agents within a computer system is often referred to as “global observation” of the operation. Typically, microprocessor operations or instructions only become globally observable after they have been stored to a cache or other memory in which other agents in the system may detect the presence of the operation or instruction.
In the case of store operations within a strong ordering microprocessor architecture, typical microprocessors will not issue a store operation from a store buffer, or other store queuing structure, or, in some cases, from the processor execution unit, until the previous store operation has been globally observed. The issuance of a store operation in typical microprocessor architectures is preceded by an operation, such as read-for-ownership (RFO) operation, to gain exclusive control of a line of the cache or other storage area in which the store operation is to be stored so that it may be globally observed. However, in typical microprocessor architectures, RFO operations are not issued until preceding store operations are globally observed.
FIG. 1 illustrates a prior art cache architecture for handling issued store operations within a strongly ordered microprocessor architecture. The store buffer contains data X1 and Y1 that are to be stored in addresses X and Y, respectively of the level-1 (L1) cache via the cache line fill buffer (LFB). However, in typical prior art architectures, neither the store data, X1 and Y1, nor their corresponding RFO operations may be issued until the data X0 and address X in the L1 cache has been globally observed.
Due to latency in the issuance, and ultimately the retiring, of store operations within prior art architectures, the overall performance of a microprocessor and the system in which it exists may be compromised. Furthermore, as PtP multiple processor systems become more pervasive, the problem may be exacerbated as each processor in the system may dependent upon data being stored by other processors within the system.