Data buffers are conventionally used in an innumerable number of applications to store data in a data processing system such as a computer. One specific application of a data buffer, for example, is in temporarily storing data received over a communications bus.
For example, data buffers are conventionally used in the memory controllers that interface one or more microprocessors with various components in a memory system. A memory system typically stores computer instructions from a computer program that are executed by the microprocessor(s), as well as other data that the microprocessor(s) manipulate in response to executed computer instructions. Moreover, a memory system is typically partitioned into a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a "memory address space," representing the addressable range of memory addresses that can be accessed by the microprocessor(s).
To cost-effectively improve the performance of a memory system, oftentimes memory system utilizes a "multi-level" memory architecture, where smaller, but faster memory devices are combined with larger, but slower memory devices, with data transferred from the slower devices to the fast devices as needed so that future accesses to the data are made using the faster devices. Oftentimes, the faster devices are referred to as cache memories, or caches, which may be dedicated to one microprocessor or shared by multiple microprocessors. When caches are used, groups of memory addresses are typically referred to as "cache lines", and a memory controller is used to swap such groups collectively into and out of a cache to attempt to maximize the frequency that requested memory addresses are stored in the fastest cache memory accessible by a microprocessor needing access to the requested addresses.
One particular multi-level memory architecture suitable for use with multiple microprocessors is a non-uniform memory architecture (NUMA), which organizes multiple microprocessors into "clusters" that includes a few microprocessors (e.g., two or four) that share a "local" set of memory devices. In some designs, for example, each microprocessor has a dedicated, internal level one (L1) cache, as well as a level two (L2) cache that is either dedicated or shared with other microprocessors in the cluster. A main memory and/or level three (L3) cache may serve as the common memory for each cluster. In addition, the clusters are connected to one another over a common bus to permit the microprocessors within a given cluster to access data stored in the local memories of other clusters. Furthermore, additional main memory, shared by all clusters, may also be accessible over the common bus, or via a separate bus to which each cluster is further interfaced.
In many NUMA systems, a single memory controller is used to interface together the various communications buses in a cluster. For example, each cluster may have one or more local (or processor) buses that communicate data between a microprocessor and its L1 and/or L2 caches. Each cluster may also have one or more main memory buses that interface with a main memory and/or an L3 cache for the cluster. Furthermore, each cluster may also have one or more remote buses that interface with the memory devices in the other clusters. The memory controller within a cluster therefore provides an interface between all such buses, and serves to route data requests to appropriate buses, as well as retrieve the data and process the responses returned over such buses. Whenever data is retrieved from a data source, the memory controller typically stores the data within a data buffer that is accessible by a requesting microprocessor.
However, in a NUMA architecture, like many other multi-level memory architectures, data for any given memory address may be stored in any number of data sources at any given time. Moreover, the data stored in different data sources may be modified from time to time, causing other copies of the data stored in other data sources to become "stale", or invalid. As such, an additional function of a memory controller in such an architecture is to determine where the most recently updated copy of requested data can be found in the memory system.
Conventional attempts to locate the most recently updated copy of requested data often rely on one or more directories that keep track of where the most recently updated copies of data are located, a process typically referred to as maintaining "coherency" among the various data sources. Particularly in NUMA and other distributed architectures, the directories are typically distributed throughout a system, with each directory only containing coherency information for data stored local to that directory (e.g., within each cluster). With distributed directories, therefore, maintaining coherence typically requires remote directories to be accessed to determine where the most recently updated copy of requested data can be found.
Coherency is typically implemented by passing a data request to one or more data sources requesting data from a particular memory location. Each data source then returns a response indicating whether or not that data source has a copy of the requested data, and the responses are combined for use in updating the various directories distributed throughout a system. To speed access, often a data source that does have a valid copy of requested data also forwards the requested data to the requester concurrently with its response.
In some conventional systems, different levels of a memory architecture are polled separately to locate data. For example, a request from a microprocessor in one cluster of a NUMA system may be first passed to the processor bus to poll the processor's L1 and/or L2 caches, then subsequently passed to the L3 bus to poll the associated L3 cache only if it is determined that the requested data is not found in the L1 and L2 caches. Moreover, a request may not be passed to the memory bus to poll other clusters and/or main memory unless and until it is determined that the requested data is not found in the local L3 cache for the cluster.
A benefit of serially issuing requests in this manner is that the amount of requests to lower levels of memory (i.e., the local L3 cache and the remote clusters in the above example) is reduced, thus occupying less available bandwidth on the buses connected thereto. Further, with this arrangement a memory controller typically requires only a single data buffer to service any given request. On the other hand, by serially issuing requests, the latency associated with retrieving data stored in a lower level of memory is increased since the request therefor is not issued until after higher levels of memory have already been checked.
In other conventional systems, the latency for lower level memories is reduced by issuing, or "broadcasting", a request on multiple buses at the same time. However, with such an arrangement, it often cannot be known in what order the responses will be returned. Furthermore, additional delay is required to combine responses to determine what data source has the most recent data. Moreover, in many conventional designs, a memory controller is implemented using separate integrated circuit devices, or chips, to handle dataflow and control logic. In such multi-chip designs, additional delay is often required for the control chip to decode the responses and inform the data chip as to which copy of the requested data to store in the data buffer.
Given that the requested data is often returned with the response of a data source that has a copy of the data, multiple data buffers may need to be used to store the incoming data from the buses so that all incoming data can be temporarily stored until the responses can be decoded. In the alternative, several levels of data staging latches may need to be interposed between each bus and the buffer to allow time to decode the responses and determine upon which bus the most recent copy of the data is found. Using multiple buffers, however, occupies more space and increases the complexity of a memory controller design. On the other hand, the addition of data staging latches increases latency and reduces performance.
Due to the inability to determine (1) the order in which responses may be received, and (2) which response will include a most recently updated copy of requested data, conventional memory controller designs typically are subject to a tradeoff between, on the one hand, performance, and on the other hand, complexity. Consequently, a significant need continues to exist for an improved manner of retrieving data from multiple available data sources that offers better performance without significantly increasing complexity.