1. Technical Field
The present invention relates in general to data processing systems and in particular to a method and system for preferentially ordering the retrieval of data from vertically configured caches. Still more particularly, the present invention relates to a method and system for preferentially ordering the retrieval of data from vertically configured caches utilizing preference order bits in the system read address bus.
2. Description of the Related Art
In conventional symmetric multiprocessor (SMP) data processing systems, all of the processors are generally identical. The processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies. For example, a conventional SMP data processing system may comprise a system memory, a plurality of processing elements that each include a processor and one or more levels of cache memory and a system bus coupling the processing elements to each other and to the system memory.
Conventional SMP data processing system processors have a number of execution units. Superscalar multiprocessors typically have more than one of each execution unit. They typically have two floating point units (FPUs), two fixed point units (FXUs) and two load/store units (LSUs). The processors are designed for high frequency and their corresponding internal caches are typically very small in order to operate with the high frequency processor. In part due to their relatively small size, these internal caches sustain a large number of cache misses during requests for data. Data is thus stored in lower level (L2 or L3, etc.) caches to maximize processing speed. The processors typically send multiple load requests simultaneously or within close proximity to each other. This is particularly true in superscalar processors with multiple LSUs.
A typical cache memory, for example, stores the contents of frequently accessed random access memory (RAM) locations and the addresses where these data items are stored. When the microprocessor references an address in memory, the cache memory checks to see whether it holds that address. If the cache memory does hold the address, the data is returned to the microprocessor; if it does not, a regular memory access occurs.
In an SMP system with processors running at very high frequencies, system performance can be highly sensitive to main memory latency. One method to reduce latency is to use an L3 cache which may be shared by multiple CPUs in the system. Since many of today""s CPUs have fairly large L2 caches, the shared cache (L3 cache) must be very large to have a marked impact on system performance.
In order to increase the speed of access to data stored within the main memory, modern dataa-processing systems generally maintain the most recently used data in the cache memory. The cache memory has multiple cache lines, with several bytes per cache line for storing information in contiguous addresses within the main memory. Each cache line essentially comprises a boundary between blocks of storage that map to a specific area in the cache memory or high-speed buffer. In addition, each cache line has an associated xe2x80x9ctagxe2x80x9d that typically identifies a partial address of a corresponding page of the main memory. Because the information within cache may come from different pages of the main memory, the tag provides a convenient way to identify which page of the main memory a cache line belongs.
In a typical cache memory implementation, information is stored in one or several memory arrays. In addition, the corresponding tags for each cache line are stored in a structure known as a directory or tag array. Usually, an additional structure, called a translation lookaside buffer (TLB), is also utilized to facilitate the translation of a virtual address to a real address during a cache memory access. Cache memory access thus involves reading out a line of the cache and its associated tag. The real address from a translation array is then compared with the real address from the tag array. If these real addresses are identical, then the line in the cache that was read out is the desired line, based on the effective or virtual address calculated by the algorithm in use.
As indicated above, data stored in a data cache or memory are stored on cache lines. A typical cache line for example, may be 64 bytes and represented in eight 8xc3x978 byte partial cache lines (i.e., 8 beats of 8 bytes).
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multi-processor computer system (indicate the validity of the value stored in the cache). The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache xe2x80x9chit.xe2x80x9d The collection of all of the address tags in a cache (and sometimes the state bit and inclusivity bit fields) is referred to as a directory, and Ad the collection of all of the value fields is the cache entry array.
In order to access a byte in a cache memory with an effective or virtual address, the line portion (mid-order bits) of the effective or virtual address is utilized to select a cache line from the memory array, along with a corresponding tag from the directory or tag array. The byte portion (low-order bits) of the effective or virtual address is then utilized to choose the indicated byte from the selected cache line. At the same time, the page portion (high-order bits) of the effective address is translated via the segment register or segment lookaside buffer and TLB to determine a real page number. If the real page number obtained by this translation matches the real address tag stored within the directory, then the data read from the selected cache line is the data actually sought by the program. If the real address tag and translated real page number do not agree, a cache xe2x80x9cmissxe2x80x9d occurs, meaning that the requested data was not stored in the cache memory. Accordingly, the requested data must be retrieved from the main memory or elsewhere within the memory hierarchy.
Both address translation and cache access involve comparison of a value read from one array with another value read from a different array. In the case of address translation, the virtual segment identifier associated with a given effective address and stored in a segment register or segment lookaside buffer is compared with the virtual address stored as part of an entry in the translation lookaside buffer. Similarly, the translated real page number is compared with the real page number read from the cache tag array to determine whether the accessed line in the cache is the required real page number.
As the need for processor efficiency increases, the retrieval order of data from cache lines becomes increasingly important. Cache lines typically contain several data values stored as words, double words, octa-words, etc. Particular data values within a cache line may be considered critical (i.e., more important to processing efficiency than the other values or desired to be retrieved in a particular order) by a processor. Cache access and data retrieval is initiated with processor load requests which are transmitted from the processor to the L1 cache first.
Load requests are comprised primarily of read addresses, which identify a location of the required data. When a read address misses on the internal memory caches (L1), they are sent over the system bus to the lower level caches (L2, L3, etc.). The addresses are sent over the system buses as snoop requests. These snoop requests are broadcasted over the system bus to every component which is connected to the system bus. The components which actively snoop the system bus, particularly the lower level caches, look up in their cache directory to see if the requested address is present in the cache. When the address is matched within the cache directory, the data is transmitted cache-to-cache over the data bus (referred to as intervention). During prior art data retrieval ordering schemes, the data was usually extracted sequentially (beat 0 through beat 7). Thus, a critical block (word) is transmitted only at the place it occurs in the particular sequence in the cache line.
Address-based ordering schemes are common in the industry. These xe2x80x9cpre-setxe2x80x9d ordering schemes are vendor specific and are static (i.e., cannot be adjusted after the system is manufactured) based on the lower address bits. Thus, in some cases, system buses and caches are designed with a set implied ordering. Two common types of ordering schemes are the International Business Machines (IBM) sequential ordering scheme, and the Intel 2N ordering scheme. Once the read address matches the address of the cache line, the system ordering scheme forces the requested data to be retrieved from the cache line and transmitted to the processor in the pre-set order.
Thus in present systems, the processor has no way of changing the pre-defined address-based order for data retrieval from the cache line to maximize processor efficiency. As an example, a processor may prefer a different instruction cache reload order than a data cache reload order. The pre-set retrieval scheme dictates the order utilized at every data request. However, the various components involved in data retrieval and transmission may have preferences which lead to better component or system efficiency. These preferences may result in system-wide or component-based optimization. For example, the cache may also have a desired method of issuing data from its cache lines which would lead to more efficient overall cache access. Thus hardware design limitations exist in the current method of requesting and retrieving data from a data cache line.
As technology becomes increasingly advanced, the need arises for microprocessors that are able to more accurately and efficiently access lower level caches and extract critical data from cache lines in an order preferred by the processor and/or system components. Currently there is no way for changing the order of the system to permit the processor to order data retrieval based on system preferences or to improve system efficiency.
The present invention recognizes that it would therefore be desirable to provide a method and system for enabling a dynamic ordering of data retrieval from vertical caches. It would be further advantageous to provide a method and system which allows a processor to determine, based on either its knowledge of cache configuration, system optimization and/or processor preference, the cache specific preference order in which data should be retrieved from the cache line of the vertical caches. It is also desirable for the processor to provide the preference ordering information on the read address bus to remove the requirement for extra bandwidth due to new/larger read address instruction set architecture.
It is therefore one object of the present invention to provide an improved data processing system.
It is another object of the present invention to provide an improved method and system for retrieving data within a data processing system.
It is yet another object of the present invention to provide an improved method and system for retrieving data from a cache of a data processing system having a vertical cache configuration, whereby preference order bits are placed on a system read address bus to direct the preferred order of retrieval at each cache.
The foregoing objects are achieved as is now described. A method for preferentially ordering the retrieval of data from a cache line of a cache within a vertical cache configuration is disclosed. The method includes the steps of first encoding a set of bits with a processor-preferred order of data retrieval based on the cache configuration. The set of bits is then sent along with the read request via the address bus to the first cache. The cache directory is check to see if a xe2x80x9chitxe2x80x9d occurs (i.e., the data is present in that cache). If the data is present, a modified cache controller having preference order logic or a preference order logic component interprets the set of bits and directs the retrieval of the requested data from the cache line according to the preferred order for that cache. If no hit (i.e., a miss) occurs, the read request and the preferred order set of bits are sent to the next level cache. In one embodiment, a single set of bits is utilized. The preference order logic encodes the set of bits with the preference order of the next level cache when a miss occurs, prior to sending the read request and the set of bits to the next level cache. When all levels of cache result in a miss, the read request is sent over the system bus with the preference order set of bits being encoded for the system wide preference.
The above as well as additional objects, features, and advantages of an illustrative embodiment will become apparent in the following detailed written description.