The present invention relates to cache memory systems and more particularly to a hierarchical caching protocol suitable for use with distributed caches (e.g., in Very Large-Scale Integration (VLSI) devices), and may be utilized within a caching input/output (I/O) hub.
As is known in the art, the system cache in a computer system serves to enhance the system performance of modern computers. For example, a cache can maintain data between a processor and relatively slower system memory by holding recently accessed memory locations in case they are needed again. The presence of cache allows the processor to continuously perform operations utilizing the data in the faster-accessing cache.
Architecturally, system cache is designed as a “monolithic” unit. In order to give a processor core simultaneous read and write access from multiple pipelines, multiple ports can be added to the monolithic cache device for external I/O devices. However, there are several detrimental architectural and implementation impacts of using a monolithic cache device with several read/write ports (for example, in a dual-ported monolithic cache). Current monolithic cache devices are not optimized for multiple ports and not the most efficient implementation available.
Computer systems are designed to accommodate a single or multiple central processing units (CPUs), coupled via a common system bus or switch to a memory and a number of external input/output devices. The purpose of providing multiple central processing units is to increase the performance of operations by sharing tasks between the processors. Such an arrangement allows the computer to simultaneously support a number of different applications while supporting I/O components that are, for example, communicating over a network and displaying images on attached display devices. Multi-processor computer systems are typically utilized for enterprise and network server systems.
To enhance performance, all of the devices coupled to the bus must communicate efficiently. Idle cycles on the system bus represent time periods in which an application is not being supported, and therefore represent reduced performance.
A number of situations arise in multi-processor computer system designs in which the bus, although not idle, is not being used efficiently by the processors coupled to the bus. Some of these situations arise due to the differing nature of the devices that are coupled to the bus. For example, processors typically include cache logic for temporary storage of data from the memory. A coherency protocol is implemented to ensure that each central processor unit only retrieves the most up to date version of data from the cache. MESI (Modified-Exclusive-Shared-Invalid) coherency protocol data can be added to cached data in order to arbitrate and synchronize multiple copies of the same data within various caches. Therefore, processors are commonly referred to as “cacheable” devices.
However, I/O components, such as those coupled to a Peripheral Component Interconnect (PCI) (“PCI Local Bus Specification”, version 2.1, Jun. 1, 1995, from the PCI Special Interest Group (PCI-SIG)) bus, are generally non-cacheable devices. That is, they typically do not implement the same cache coherency protocol that is used by the processors. Typically, I/O components retrieve data from memory, or a cacheable device, via a Direct Memory Access (DMA) operation. Accordingly, measures must be taken to ensure that I/O components only retrieve valid data for their operations. An I/O device may be provided as a connection point between various I/O bridge components, to which I/O components are attached, and ultimately, to the processor.
An input/output (I/O) device may be utilized as a caching I/O device. That is, the I/O device includes a single, monolithic caching resource for data. Therefore, because an I/O device is typically coupled to several client ports, a monolithic I/O cache device will suffer the same detrimental architectural and performance impacts as previously discussed. Current I/O cache device designs are not efficient implementations for high performance systems.
Implementing multiple cache systems for processors and I/O devices requires cache coherency amongst the caches. Cache coherency is the synchronization of data in a plurality of caches such that reading a memory location via any cache will return the most recent data written to that location via any other cache. Current solutions for synchronizing multiple cache systems include utilizing the MESI coherency protocol and having each cache broadcast its request to every other cache in the system and then wait for a response from those devices. This approach has the inherent problem of being non-scalable. As additional cache devices are added as needed to a multiple cache system, latency throughout the system is increased dramatically, thereby decreasing overall system performance. Multiple cache systems become impractical to implement, and therefore, a need exists for a scalable method of improving the efficiency and latency performance of distributed caches. Furthermore, a need exists for an improved coherency protocol to improve the performance of synchronization of increased data bandwidth inherent in a distributed cache system.
In view of the above, there is a need for a method and apparatus for synchronizing distributed caches in VLSI device, namely, high performance I/O systems.