1. Field of the Invention
The present invention relates to an improved read line buffer for cache systems of processor and to a communication protocol in support of such a read line buffer.
2. Related Art
In the electronic arts, processors are being integrated into multiprocessor designs with increasing frequency. A block diagram of such a system is illustrated in FIG. 1. There, a plurality of agents 10-40 are provided in communication with each other over an external bus 50. The agents may be processors, cache memories or input/output devices. Data is exchanged among the agents in a bus transaction.
A transaction is a set of bus activities related to a single bus request. For example, in the known Pentium Pro processor, commercially available from Intel Corporation, a transaction proceeds through six phases:
Arbitration, in which an agent becomes the bus owner, PA1 Request, in which a request is made identifying an address, PA1 Error, in which errors in the request phase are identified, PA1 Snoop, in which cache coherency checks are made, PA1 Response, in which the failure or success of the transaction is indicated, and PA1 Data, in which data may be transferred.
Other processors may support transactions in other ways.
In multiple agent systems, the external bus 50 may be a pipelined bus. In a pipelined bus, several transactions may progress simultaneously provided the transactions are in mutually different phases. Thus, a first transaction may be started at the arbitration phase while a snoop response of a second transaction is being generated and data is transferred according to a third transaction. However, a given transaction generally does not "pass" another in the pipeline.
Cache coherency is an important feature of a multiple agent system. If an agent is to operate on data, it must confirm that the data it will read is the most current copy of the data that is available. In such multiple agent systems, several agents may operate on data from a single address. Oftentimes when a first agent 10 desires to operate on data at an address, a second agent 30 may have cached a copy of the data that is more current than the copy resident in an external cache. The first agent 10 should read the data from the second agent 10 rather than from the external cache 40. Without a means to coordinate among agents, an agent 10 may perform a data operation on stale data.
In a snoop phase, the agents coordinate to maintain cache coherency. In the snoop phase, each of the other agents 20-40 reports whether it possesses a copy of the data or whether it possesses a modified ("dirty") copy of the data at the requested address. In the Pentium Pro, an agent indicates that it possesses a copy of the data by asserting a HIT# pin in a snoop response. It indicates that it possesses a dirty copy of the requested data by asserting a HITM# pin. If dirty data exists, it is more current than the copy in memory. Thus, dirty data will be read by an agent 10 from the agent 20 possessing the dirty copy. Non-dirty data is read by an agent 10 from memory. Only an agent that possesses a copy of data at the requested address drives a snoop response; if an agent does not possess such a copy, it generates no response.
A snoop response is expected from all agents 10-40 within a predetermined period of time. Occasionally, an agent 30 cannot respond to another agent's request before the period closes. When this occurs, the agent 30 may generate a "snoop stall response" that indicates that the requesting agent 10 must wait beyond the period for snoop results. In the Pentium Pro processor, the snoop stall signal occurs when an agent 30 toggles outputs HIT# and HITM# from high to low in unison.
FIG. 2 illustrates components of a bus sequencing unit ("BSU") 100 and a core 200 within a processor 10 as are known in the art. The BSU 100 manages transaction requests generated within the processor 10 and interfaces the processor 10 to the external bus 50. The core 200 executes micro operations ("UOPs"), such as the processing operations that are required to execute software programs.
The BSU 100 is populated by a bus sequencing queue 140 ("BSQ"), an external bus controller 150 ("EBC"), a read line buffer 160 and a snoop queue 170. The BSQ 140 processes requests generated within the processor 10 that must be referred to the external bus 50 for completion. The EBC 150 drives the bus to implement requests. It also monitors transactions initiated by other agents on the external bus 50. The snoop queue 170 monitors snoop requests made on the external bus 50, polls various components within processor 10 regarding the snoop request and generates snoop results therefrom. The snoop results indicate whether the responding agent possesses non-dirty data, dirty data or is snoop stalling. Responsive to the snoop results, the EBC 150 asserts the result or the external bus.
As noted, the BSQ 140, monitors requests generated from within the processor 10 to be referred to the external bus 50 for execution. An example of one such request is a read of data from external memory to the core 200. "Data" may represent either an instruction to be executed by the core or variable data representing data input to such an instruction. The BSQ 140 passes the request to the EBC 150 to begin a transaction on the external bus 50. The BSQ 140 includes a buffer memory 142 that stores the requests tracked by the BSQ 140. The number of registers 142a-h in memory 142 determines how many transactions the BSQ 140 may track simultaneously.
The EBC 150 tracks activity on the external bus 50. It includes a pin controller 152 that may drive data on the external bus 50. It includes an in-order queue 154 that stores data that is asserted on the bus at certain events. For example, snoop results to be asserted on the bus during a snoop phase may be stored in the in-order queue 154. The EBC 150 interfaces with the snoop queue 170 and BSQ 140 to accumulate data to be asserted on the external bus 50.
During the data phase of a transaction, data is read from the external bus 50 into the read line buffer 160. The read line buffer 160 is an intermediate storage buffer, having a memory 162 populated by its own number of registers 162a-h. The read line buffer 160 provides for storage of data read from the external bus 50. The read line buffer 160 stores the data only temporarily; it is routed to another destination such as a cache 180 in the BSU 100, a data cache 210 in the core or an instruction cache 220 in the core. Data read into a read line buffer storage entry 162a is cleared when its destination becomes available.
There is a one-to-one correspondence between read line buffer entries 162a-h and BSQ buffer entries 140a-h. Thus, data from a request buffered in BSQ entry 142a will be read into buffer entry 162a. For each request buffered in BSQ buffer 142, data associated with the request is buffered in the buffer memory 162 in the read line buffer 162.
The one to one correspondence between the depth of the BSQ buffer 142 and the read line buffer 160 is inefficient. Read line buffer utilization is very low. The read line buffer 160 operates at a data rate associated with the BSU 100 and the core 200 which is much higher than a data rate of the external bus 50. Thus, data is likely to be read out of the read line buffer 160 faster than the bus 50 can provide data to it. The one to one correspondence of BSQ buffer entries to the read line buffer entries is unnecessary. Also, the read line buffer storage entries 162a-h consume a significant amount of area when the processor is fabricated as an integrated circuit.
It is desired to increase the depth of buffers in the BSQ 140. In the future, latency between the request phase and the data phase of transactions on the external bus 50 is expected to increase. External buses 50 will become more pipelined. Consequently, a greater number of transactions will progress on the external bus 50 at once. Accordingly, greater depth of BSQ buffers 142 will be necessary to track these transactions. However, because it requires a corresponding increase in the depth of the read line buffer 162, increasing the depth of such buffers 142 incurs substantial area costs. Also, it would further decrease the already low utilization of the read line buffer 160. Accordingly, there is a need in the art for a processor architecture that severs the relationship between the read line buffer depth and the BSQ buffer depth.