1. Field of the Invention
The invention relates to computer systems in which a host processor and a bus master can access the same address space, and more particularly, to techniques for facilitating burst accesses by such a master.
2. Description of Related Art
In a typical IBM PC/AT-compatible computer system, a host processing unit is coupled to a host bus and most I/O peripheral devices are coupled to a separate I/O bus. The host processing unit typically comprises an Intel i386, i486 or Pentium(trademark) microprocessor, and the I/O bus typically conforms to a standard known as ISA (Industry Standard Architecture). I/O interface circuitry, which usually comprises one or more chips in a xe2x80x9ccore logic chipsetxe2x80x9d, provides an interface between the two buses. A typical system also includes a memory subsystem, which usually comprises a large array of DRAM and perhaps a cache memory.
General information on the various forms of IBM PC AT-compatible computers can be found in IBM, xe2x80x9cTechnical Reference, Personal Computer ATxe2x80x9d (1985), in Sanchez, xe2x80x9cIBM Microcomputers: A Programmer""s Handbookxe2x80x9d (McGraw-Hill: 1990), in MicroDesign Resources, xe2x80x9cPC Chip Setsxe2x80x9d (1992), and in Solari, xe2x80x9cAT Bus Designxe2x80x9d (San Diego: Annabooks, 1990). See also the various data books and data sheets published by Intel Corporation concerning the structure and use of the 80xc3x9786 family of microprocessors, including Intel Corp., xe2x80x9cPentium(trademark) Processorxe2x80x9d, Preliminary Data Sheet (1993); Intel Corp., xe2x80x9cPentium(trademark) Processor User""s Manualxe2x80x9d(1994); xe2x80x9ci486 Microprocessor Hardware Reference Manualxe2x80x9d, published by Intel Corporation, copyright date 1990, xe2x80x9c386 SX Microprocessorxe2x80x9d, data sheet, published by Intel Corporation (1990), and xe2x80x9c386 DX Microprocessorxe2x80x9d, data sheet, published by Intel Corporation (1990). In addition, a typical core logic chipset includes the OPTi 82C802G and either the 82C601 or 82C602, all incorporated herein by reference. The 82C802G is described in OPTi, Inc., xe2x80x9cOPTi PC/AT Single Chip 82C802G Data Bookxe2x80x9d, Version 1.2a (Dec. 1, 1993), and the 82C601 and 82C602 are described in OPTi, Inc., xe2x80x9cPC/AT Data Buffer Chips, Preliminary, 82C601/82C602 Data Bookxe2x80x9d, Version 1.0e (Oct. 13, 1993). All the above references are incorporated herein by reference.
Many IBM PC AT-compatible computers today include one, and usually two, levels of cache memory. A cache memory is a high-speed memory that is positioned between a microprocessor and main memory in a computer system in order to improve system performance. Cache memories (or caches) store copies of portions of main memory data that are actively being used by the central processing unit (CPU) while a program is running. Since the access time of a cache can be faster than that of main memory, the overall access time can be reduced. Descriptions of various uses of and methods of employing caches appear in the following articles: Kaplan, xe2x80x9cCache-based Computer Systems,xe2x80x9d Computer, 3/73 at 30-36; Rhodes, xe2x80x9cCaches Keep Main Memories From Slowing Down Fast CPUs,xe2x80x9d Electronic Design, Jan. 21, 1982, at 179; Strecker, xe2x80x9cCache Memories for PDP-11 Family Computers,xe2x80x9d in Bell, xe2x80x9cComputer Engineeringxe2x80x9d (Digital Press), at 263-67, all incorporated herein by reference. See also the description at pp. 6-1 through 6-11 of the xe2x80x9ci486 Processor Hardware Reference Manualxe2x80x9d incorporated above.
Many microprocessor-based systems implement a xe2x80x9cdirect mappedxe2x80x9d cache memory. In general, a direct mapped cache memory comprises a high speed data Random Access Memory (RAM) and a parallel high speed tag RAM. The RAM address of each line in the data cache is the same as the low-order portion of the main memory line address to which the entry corresponds, the high-order portion of the main memory address being stored in the tag RAM. Thus, if main memory is thought of as 2mblocks of 2n xe2x80x9clinesxe2x80x9d of one or more bytes each, the i""th line in the cache data RAM will be a copy of the i""th line of one of the 2m blocks in main memory. The identity of the main memory block that the line came from is stored in the i""th location in the tag RAM.
When a CPU requests data from memory, the low-order portion of the line address is supplied as an address to both the cache data and cache tag RAMs. The tag for the selected cache entry is compared with the high-order portion of the CPU""s address and, if it matches, then a xe2x80x9ccache hitxe2x80x9d is indicated and the data from the cache data RAM is enabled onto a data bus of the system. If the tag does not match the high-order portion of the CPU""s address, or the tag data is invalid, then a xe2x80x9ccache missxe2x80x9d is indicated and the data is fetched from main memory. It is also placed in the cache for potential future use, overwriting the previous entry. Typically, an entire line is read from main memory and placed in the cache on a cache miss, even if only a byte is requested. On a data write from the CPU, either the cache RAM or main memory or both may be updated, it being understood that flags may be necessary to indicate to one that a write has occurred in the other.
Accordingly, in a direct mapped cache, each xe2x80x9clinexe2x80x9d of secondary memory can be mapped to one and only one line in the cache. In a xe2x80x9cfully associativexe2x80x9d cache, a particular line of secondary memory may be mapped to any of the lines in the cache; in this case, in a cacheable access, all of the tags must be compared to the address in order to determine whether a cache hit or miss has occurred. xe2x80x9ck-way set associativexe2x80x9d cache architectures also exist which represent a compromise between direct mapped caches and fully associative caches. In a k-way set associative cache architecture, each line of secondary memory may be mapped to any of k lines in the cache. In this case, k tags must be compared to the address during a cacheable secondary memory access in order to determine whether a cache hit or miss has occurred. Caches may also be xe2x80x9csector bufferedxe2x80x9d or xe2x80x9csub-blockxe2x80x9d type caches, in which several cache data lines, each with its own valid bit, correspond to a single cache tag RAM entry.
When the CPU executes instructions that modify the contents of the cache, these modifications must also be made in the main memory or the data in main memory will become xe2x80x9cstale.xe2x80x9d There are two conventional techniques for keeping the contents of the main memory consistent with that of the cachexe2x80x94(1) the write-through method and (2) the write-back or copy-back method. In the write-through method, on a cache write hit, data is written to the main memory immediately after or while data is written into the cache. This enables the contents of the main memory always to be valid and consistent with that of the cache. In the write-back method, on a cache write hit, the system writes data into the cache and sets a xe2x80x9cdirty bitxe2x80x9d which indicates that a data word has been written into the cache but not into the main memory. A cache controller checks for a dirty bit before overwriting any line of data in the cache, and if set, writes the line of data out to main memory before loading the cache with new data.
A computer system can have more than one level of cache memory for a given address space. For example, in a two-level cache system, the xe2x80x9clevel onexe2x80x9d (L1) cache is logically adjacent to the host processor. The second level (L2) cache is logically behind the first level cache, and DRAM memory (which in this case can be referred to as tertiary memory) is located logically behind the second level cache. When the host processor performs an access to an address in the memory address space, the first level cache responds if possible. If the first level cache cannot respond (for example, because of an L1 cache miss), then the second level cache responds if possible. If the second level cache also cannot respond, then the access is made to DRAM itself. The host processor does not need to know how many levels of caching are present in the system or indeed that any caching exists at all. Similarly, the first level cache does not need to know whether a second level of caching exists prior to the DRAM. Thus, to the host processing unit, the combination of both caches and DRAM is considered merely as a single main memory structure. Similarly, to the L1 cache, the combination of the L2 cache and DRAM is considered simply as a single main memory structure. In fact, a third level of caching could be included between the L2 cache and the actual DRAM, and the L2 cache would still consider the combination of L3 and DRAM as a single main memory structure.
As the x86 family of microprocessors has advanced, additional functions have been included on the microprocessor chip itself. For example, while i386-compatible microprocessors did not include any cache memory on-chip, the i486-compatible microprocessors did. Specifically, these microprocessors included a level one, xe2x80x9cwrite-throughxe2x80x9d cache memory.
Pentium-compatible microprocessors also include a level one cache on-chip. This cache is divided into a data cache and a separate code cache. Unlike the cache included on the i486-compatible microprocessor chips, the data cache on a Pentium chip follows a write-back policy. The cache is actually programmable on a line-by-line basis to follow a write-through or a write-back policy, but special precautions must be taken externally to the chip as long as even one line is to follow a write-back policy as further explained below. Thus, as used herein, a xe2x80x9cwrite-back cachexe2x80x9d is a cache memory, any part of which can hold data which is inconsistent with that in the external memory subsystem while an access takes place to the same memory address space by another bus master.
The data cache on a Pentium chip implements a xe2x80x9cmodified/exclusive/shared/invalidxe2x80x9d (MESI) write-back cache consistency protocol, whereas the code cache only supports the xe2x80x9csharedxe2x80x9d and xe2x80x9cinvalidxe2x80x9d states of the MESI protocol. The MESI protocol is described in xe2x80x9cIntel, xe2x80x9cPentium Processor User""s Manual, Vol. 1: Pentium Processor Databookxe2x80x9d (1993), incorporated herein by reference, especially at pp. 3-20 through 3-21. In the MESI protocol, each cache data line is accompanied by a pair of bits which indicate the status of the line. Specifically, if a line is in state M, then it is xe2x80x9cmodifiedxe2x80x9d (different from main memory). In multiprocessor systems in which more than one of the processors has a cache, state M also indicates that the line is available in only one cache. An M-state line can be accessed (read or written) by the host processor unit without sending a cycle out on an external bus to higher levels of the memory subsystem.
If a cache line is in state E (xe2x80x9cexclusivexe2x80x9d), then it is not xe2x80x9cmodifiedxe2x80x9d (i.e. it contains the same data as subsequent levels of the memory subsystem). In shared cache systems, state E also indicates that the cache line is available in only one of the caches. The host processor unit can access (read or write) an E-state line without generating a bus cycle to higher levels of the memory subsystem, but when the host processor performs a write access to an E-state line, the line then becomes xe2x80x9cmodifiedxe2x80x9d (state M).
A line in state S (xe2x80x9csharedxe2x80x9d) may exist in more than one cache. A read access by the host processor to an S-state line will not generate bus activity, but a write access to an S-state line will cause a write-through cycle to higher levels of the memory subsystem in order to permit the sharing cache to potentially invalidate its own corresponding line. The write will also update the data in the data cache line.
A line in state I is invalid. It is not available in the cache. A read access by the host processor unit to an I-state line will generate a xe2x80x9ccache missxe2x80x9d and may cause the cache to execute a line fill (fetch the entire line into the cache from higher levels of the memory subsystem). A write access by the host processor unit to an I-state line will cause the cache to execute a write-through cycle to higher levels of the memory subsystem.
Computer system cache memories typically cache main memory data for the CPU. If the cache uses a write-back protocol, then frequently the cache memory will contain more current data than the corresponding lines in main memory. This poses a problem for other bus masters (and for other CPUs in a multiprocessor system) desiring to access a line of main memory, because it is not known whether the main memory version is the most current version of the data. Write-back cache controllers, therefore, typically support inquire cycles (also known as snoop cycles), in which a bus master asks the cache memory to indicate whether it has a more current copy of the data.
In Pentium-based systems, a bus master initiates an inquire cycle by driving the inquire address onto the CPU address leads and asserting EADS#. The processor responds by asserting its HIT# output if the specified data line is present in the L1 cache. The processor also asserts an HITM# output if the specified L1 cache line is in the M (modified) state. Thus, HITM#, when asserted, indicates that the L1 cache contains a more current copy of the data than is in main memory. The processor then automatically conducts a write-back cycle while the external bus master waits. By this process, therefore, the external bus master will be able to access the desired line in main memory without any further concern that the processor""s L1 cache contains a more current copy of the data.
One of the bottlenecks that has limited the performance of personal computers in the past has been the maximum specified speed of the ISA bus. The original IBM PC AT computers manufactured by IBM Corp., the I/O bus operated with a data rate of 8 MHz (BCLK=8 MHz). This was an appropriate data rate at that time since it was approximately equivalent to the highest data rates which the CPUs of that era could operate with on the host bus. CPU data rates are many times faster today, however, so the slow speed of the I/O bus severely limits the throughput of systems today. One solution for this problem has been the development of a local bus standard, by which certain devices which were traditionally located on the I/O bus can now be located on the host bus. This standard, referred to herein as the VESA (Video Electronics Standards Association) or VL-Bus standard, is defined in VESA, xe2x80x9cVESA VL-Bus Local Bus Standardxe2x80x9d, Revision 1.0 (1992), and in VESA, xe2x80x9cVESA VL-Bus Proposal, Version 2.0p, Revision 0.8p (May 17, 1993), both incorporated herein by reference.
Another solution to the problem has been the development of another standard, referred to herein as the PCI standard, defined in PCI Special Interest Group, xe2x80x9cPCI Local Bus Specification Revision 2.0xe2x80x9d (Apr. 30, 1993), incorporated herein by reference. As used herein, the term xe2x80x9cPCI busxe2x80x9d refers to a bus which adheres to this specification, whether or not it also adheres to subsequent revisions of the specification. The PCI bus achieves very high performance, in part because its basic data transfer mode is by burst. That is, data is always transferred to or from a PCI device in a known sequence of data units defined by a known sequence of data unit addresses in an address space. In the xe2x80x9ccache linexe2x80x9d burst mode, exactly four transfers take place. In the xe2x80x9clinearxe2x80x9d burst mode, any number of transfers (including 1) can take place to/from linearly sequential addresses until either the initiator or the target terminates the transaction. In either mode, the initiator need only specify the starting address because both parties know the sequence of addresses which follow.
Because of the burst mode of PCI masters, the problem of performing inquire cycles is somewhat more difficult when the bus master is a PCI-bus master than when it is a CPU bus master or ISA-bus master. According to the Pentium databooks, every data transfer to or from the memory address space which is cached by the L1 cache should be preceded by an inquire cycle. This would severely hamper the performance of PCI masters performing burst cycles to or from secondary memory. Many PCI-bus controller chipsets speed up these transfers by performing an inquire cycle only once per cache line instead of on each data transfer. These controllers simply assume that no change will be made to the cache line contents during the remainder of the PCI-bus master burst transfer with the corresponding line of secondary memory. The Intel 82433LX local bus accelerator, for example, maintains a PCI-to-memory read prefetch buffer equal in depth to the length of one cache line, so that if the Pentium processor performs a write-back cycle in response to the inquire cycle, the local bus accelerator chip can capture the remaining words of the cache line for easy completion of further PCI-bus master read accesses within the burst. The 82433LX is described in Intel, xe2x80x9c82340 PCIset Cache/Memory Subsystemxe2x80x9d (April 1994), incorporated herein by reference.
Even with inquire cycles limited to one per cache line, a problem still exists if the desired burst length proceeds past a cache line boundary. Conventional chipsets determine when a new access in the burst is in a new cache line, and they withhold the PCI-bus TRDY# signal while they perform the necessary inquire cycle for the new cache line. If the Pentium processor asserts HITM#, then the chipset stops the PCI-bus transaction (using a target disconnect termination), allows the L1 cache to perform a write-back operation, and resumes with a new inquire when the PCI master restarts the transaction where it left off. Some chipsets do not stop the PCI-bus transaction, but rather merely withhold TRDY# until the write-back cycle and new inquire cycle are complete, but this violates the PCI-bus specification which calls for a maximum delay of eight PCI-bus clock cycles before a target asserts a TRDY# within a burst. If the inquire cycle for the new line of cache does not produce HITM#, then there is no need to stop the PCI transaction. Instead, conventional chipsets merely withhold TRDY# for the time required to perform the inquire cycle, and then assert TRDY# when the inquire cycle has completed without HITM#.
The time required to perform the inquire cycle, however, is significant. On the PCI-bus, a delay of eight PCI-bus clock cycles may be incurred each time that a linear burst transaction crosses a cache line boundary. A definite need, therefore, exists for a mechanism which allows PCI-bus bursts to proceed past a cache line boundary whenever possible. Such a mechanism can help PCI-bus masters achieve the full promise of high-speed data transfers afforded by the PCI-bus burst transfer protocol.
According to the invention, roughly described, when a PCI-bus controller receives a request from a PCI-bus master to transfer data with an address in secondary memory, the controller performs an initial inquire cycle and withholds TRDY# to the PCI-bus master until any write-back cycle completes. The controller then allows the burst access to take place between secondary memory and the PCI-bus master, and simultaneously and predictively, performs an inquire cycle of the L1 cache for the next cache line. In this manner, if the PCI burst does in fact continue past the cache line boundary, the new inquire cycle will already have taken place (or will already be in progress), thereby allowing the burst to proceed with at most a short delay absent a hit-modified condition. This avoids the need to incur the penalty of stopping the transfer on the PCI bus and restarting it anew at a later time, every time a linear burst transaction crosses a cache line boundary.
In one embodiment, predictive snoop cycles are not performed if the first transfer of a PCI-bus master access would be the last transfer before a cache line boundary is reached, since no advantage would be obtained. In another embodiment, predictive snoop cycles are performed if the first transfer of a PCI-bus master access would be the second-to-last transfer before a cache line boundary is reached, even though some delay will be experienced before the transfer of the first data unit of the next cache line due to the predictive snoop cycle and synchronization delays.
Although the invention is described herein with respect to a PCI-bus Pentium system, its usefulness is not limited to such systems. The invention is useful whenever an L1 cache is present which can use a write-back protocol, and which supports inquire cycles, and whenever an I/O bus is present which has a linear-incrementing capability or mode which can continue beyond an L1 cache line boundary.