1. Field of the Invention
The present invention generally relates to computer systems, and more specifically to a method of reducing memory access latency when a processor of a computer system requests a value which is not currently located in the processor's cache memory, and particularly to such a method adapted for use with a computing system wherein a higher level bus (such as a processor bus) operates at a different frequency from that of a lower level bus (such as the system bus).
2. Description of Related Art
A typical structure for a conventional computer system includes one or more processing units connected to a system memory device (random access memory or RAM) and to various peripheral, or input/output (I/O), devices such as a display monitor, a keyboard, a graphical pointer (mouse), and a permanent storage device (hard disk). The system memory device is used by a processing unit in carrying out program instructions, and stores those instructions as well as data values that are fed to or generated by the programs. A processing unit communicates with the peripheral devices by various means, including a generalized interconnect or bus, or direct memory-access channels. A computer system may have many additional components, such as serial and parallel ports for connection to, e.g., modems, printers, and network adapters. Other components might further be used in conjunction with the foregoing; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access the system memory, etc.
A conventional processing unit includes a processor core having various execution units and registers, as well as branch and dispatch units which forward instructions to the appropriate execution units. Caches are commonly provided for both instructions and data, to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory (RAM). These caches are referred to as "on-board" when they are integrally packaged with the processor core on a single integrated chip. Each cache is associated with a cache controller or bus interface unit that manages the transfer of values between the processor core and the cache memory.
A processing unit can include additional caches, such as a level 2 (L2) cache which supports the on-board (level 1) caches. In other words, the L2 cache acts as an intermediary between system memory and the on-board caches, and can store a much larger amount of information (both instructions and data) than the on-board caches can, but at a longer access penalty. Multi-level cache hierarchies can be provided where there are many levels of interconnected caches.
A typical system architecture is shown in FIG. 1, and is exemplary of the PowerPC.TM. processors marketed by International Business Machines Corporation (IBM--assignee of the present invention). Computer system 10 includes a processing unit 12a, various I/O devices 14, RAM 16, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals whenever the computer is first turned on. Processing unit 12a communicates with the peripheral devices using a bus 20. Processing unit 12a includes a processor core 22, and an instruction cache 24 and a data cache 26, which are implemented using high speed memory devices, and are integrally packaged with the processor core on a single integrated chip 28. Cache 30 (L2) supports caches 24 and 26. For example, cache 30 may be a chip having a storage capacity of 256 or 512 kilobytes, while the processor may be an IBM PowerPC.TM. 604-series processor having on-board caches with 64 kilobytes of total storage. Cache 30 is connected to system bus 20 and a cache or processor bus 32, and all loading of information from memory 16 into processor core 22 must come through cache 30. More than one processor may be provided, as indicated by processing unit 12b.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multi-processor computer system (indicating the validity of the value stored in the cache). The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache "hit." The collection of all of the address tags in a cache is referred to as a directory (and sometimes includes the state bit and inclusivity bit fields), and the collection of all of the value fields is the cache entry array.
When a cache receives a request from a processor core, whether a read or write operation, to access a memory location, and the cache does not have a current valid copy of the value (data or instruction) corresponding to that memory location (a cache "miss"), the cache must wait to fulfill the request until the value can be retrieved from a location lower in the memory hierarchy. Cache misses thus introduce a memory access latency penalty, and can occur at every level of the cache architecture. When a cache miss occurs at the lowest level (e.g., the L2 cache in FIG. 1), the system bus latency can seriously degrade performance.
One technique for reducing bus latency involves "pipelining." An address of a memory block is passed using an address bus which is separate from the data bus used to transmit the actual value associated with the memory block. In a cache that allows pipelining, the address tenure of a subsequent (second) bus operation can overlap with the data tenure of a current (first) operation. This feature can improve bus throughput because most of the bus traffic involves burst transfers in which a lot of data is transferred with one address; for example, a microprocessor might transfer eight 32-bit words for each 32-bit address in a burst transfer. Overlapping of subsequent address phases with the lengthy data phase achieves a pipeline effect which can reduce the idle time on the data bus. With multi-processor systems in particular, there can be a large amount of inter-processor communications, many of which are address-only bus operations that do not require the data bus; by pipelining, these operations have a negligible impact on data bus bandwidth. In addition to allowing pipelining, a bus can also be "split," meaning that other bus activity, including activity from other processing units (masters), can start between the address and data tenures of a previous transaction.
In a system where the address and data buses are pipelined and split, additional design complexities are introduced in the bus interface logic. For example, in order to support a range of transaction types and maintain cache coherency states (which are used to ensure that only one processor has permission to write to a given memory location at any point in time and that all copies of the memory location are consistent), bus interface designs tend to couple the address bus operations with various aspects of the data bus operations. This coupling of the address bus and data bus operations, however, restricts expandability of the bus interface.
One method for decoupling the address and data buses is discussed in the article "Separating the Interaction of Address and Data State During Bus Data Transfers," IBM Technical Disclosure Bulletin, vol. 37, no. 5 (May 1994). According to that method, the address tenure begins with the TS_ signal (Transfer Start) asserted by the processor. A single signal AACK_ (Address ACKnowledge) is used to terminate the address tenure. No data bus signals can cause the address tenure to terminate. Two status lines are sampled during the cycle following address termination, including an address retry signal ARTRY_. The address tenure can be forced to re-execute via assertion of ARTRY_, by some other bus device which cannot immediately determine the appropriate response to the request (for example, if its snoop queue is already full). Data is generally not committed until the address tenure completes successfully and cannot be re-tried by other bus devices. In order to be able to commit data quickly, the specification constraints for this method require that any signaling of a retry must occur at the same time that the data is asserted. Retry cannot be signalled after this "retry window" even though the address tenure is still open.
The foregoing protocol for decoupling the address and data buses has the unfortunate effect of contributing to bus latency in a multi-level cache, that is, where two buses are connected together serially such that their address and data tenures are connected, and particularly when the clock frequencies of the two buses are different (e.g., a ratio such as 2:1 or 3:1). In this construction, a tenure (address or data) in the higher level bus (such as a CPU bus) is closed only after the same tenure is closed on the lower level bus (such as the system bus), as shown in FIG. 2.
In FIG. 2 (which assumes a 2:1 clock ratio), the address tenure on the processor bus begins with the assertion of the CPU_TS_ signal. When an L2 miss occurs, the request is asserted on the system bus using the SYS_TS_ signal. Sometime thereafter, the address tenure on the system bus is terminated by toggling SYS_AACK, and the signal SYS_ARTRY_ is checked. The next cycle, CPU_AACK is asserted to terminate the address tenure on the CPU bus. This constraint further prevents the L2 controller from returning data to the CPU bus until SYS_AACK_ is asserted. If, however, a retry was not actually issued, then the bridge waits unnecessarily until the assertion of SYS_AACK_ to close the CPU address tenure. It would, therefore, be desirable and advantageous to devise a bus interface for a bus bridge, such as an L2 controller, which reduced bus latency by returning data from the system bus to the CPU bus in a more efficient manner.