With the advent of integrated memory and logic on one high performance chip, various opportunities become available for improving system performance. One significant enhancement is the ability to integrate two levels of a memory hierarchy together with a CPU on one chip. For instance, referring to FIG. 1, a processor with its cache, L1, and the next higher level which serves it, namely L2, can be integrated on the emerging type of chips. Referring to FIG. 1, in such a system, a separate L1 directory 125 and a separate L2 directory 135 as well as storage arrays 120, 130 are used for each level. Each such directory has its own access decoders, compare circuits and associated logic to search for cache blocks in the respective storage arrays. Each level could, and might have a different set-associativity, as well as cache organization. For instance a level L1 cache is often organized as a late-select, set-associative cache. However, the L2 level associated with it might use a sequential organization, or an early-select organization of the type described in the referenced application, Ser. No. 08/888,730. As the cache levels continue to increase in size, the multiple directories become large, consuming non-negligible chip area. In addition, access to each directory of each level consumes power, as well as time. Thus it would be advantageous to combine multiple L1 and L2 elements into a single L1/L2 element, if possible, while still allowing the individual cache levels to use their own organization which may be best for overall performance.
In order to understand the issue of ‘set-associativity’ in the L1 cache, it is helpful to consider first the effect of associativity on both the L1 directory and L1 storage array (storage array can contain instructions or data) in the uncombined case. First the fundamental notions of set-associativity and a late-select organization are presented. It is shown that the larger the set-associativity, the larger are the number of compares which must be made via the directory. Also, for a late-select L1 storage array, larger set-associativity requires a much larger data access width from the arrays. Thus the set-associativity is selected for optimum speed and cost of implementation. The trade off in the L2 cache are different so the organization in terms of set-associativity and directory/storage-array organization are usually different. It is advantageous to provide all these features in a combined directory.
Since the earliest days, caches have, with few exceptions, all been organized in a set-associative configuration. This type of organization is often thought to be complex, but is extremely simple. In fact, this organization is very commonly used by everyone at one time or another, but we are just not aware of its application elsewhere. An ordinary, “tab-indexed” address book or telephone directory is a perfect analogy to a cache in every respect and is used in the following, to make the concepts understandable. The “tab-indexed” directory used is the ordinary desk-top type of phone directory which allows one to move a mechanical selector to some letter of the alphabet and then “push a button” to access the information contained under that letter. One could use an ordinary address book which has “tab-indices” on each page, just as well. In this case one would mechanically use one's fingers to select one of the tabs and lift “open” the desired page. All the principles and ideas are identical for both the address book and a mechanical tab-indexed desk directory.
In the most simple case, the DATA associated with a given “Search Address” is quite small, so both the DATA and Search Address can be contained in one physical structure. For instance, the Search Address is usually a person's Name, and the DATA is the Phone Number (or Address in the case of an Address Directory). Thus, for this case, the Search Address of the desired information, or that part of it used as the Compare Address, resides with the Data.
Such a Tab-Indexed phone directory 200 is shown in FIG. 2. One tab-index selector position is used for each letter of the alphabet. The directory entry for each such tab-index position is known as a ‘congruence class’. A Congruence Class as defined herein is sometimes called a SET which differs from the word SET used herein. The reader is cautioned that there are different definitions of SET used throughout the computer industry. A ‘congruence class’ as defined here is sometimes called a SET with no name for what we define as a SET herein. Thus there is one congruence class for each letter of the alphabet. So all names beginning with the letter of the alphabet belonging to a given congruence class must be found here, or reloaded as needed. In our Directory, it is ASSUMED that each congruence class can contain only four entries with each entry consisting of a name plus its associated phone number.
This is EXACTLY a 4-way set-associative directory/cache and works as follows. Suppose we had previously reloaded congruence class K with four names shown in the Directory, namely Kern, Kagan, Knoll, and Krons. Internal to the directory, we do not have to include the letter K with each name since the external mechanical selector picks (translates) the letter K—the names cannot start with any other letter in this congruence class. (In an actual phone directory, we normally include the first letter as well, but only for convenience—it is fundamentally unnecessary). This K congruence class contains the numbers, 1745 for Kern 221, 2391 for Kagan 222, etc. Now suppose we wish to find the number for Kagan, which is the full address appearing in our address register at the top of FIG. 2. The first letter, K, is used to access the K congruence class by moving the tab-index selector to the letter K as shown. We “Open” the directory for this selection and retrieve four names and four numbers. The remaining portion of the starting address, namely “AGAN” (without the K) is compared “in our brains” with the four names accessed. If a match occurs, then we select the corresponding phone number (or address in an Address Directory) for use. In this case, a match (HIT) occurs on the second entry in the K congruence class, so the second number is chosen.
Note that for the general case, the arrangement of the 4 names in the congruence class is purely random for reasons discussed below. This random arrangement plus the fact that there can be no direct address relation between the large number of names and the 4 possible locations in a congruence class, requires us to perform an associative search on 4 entries, i.e. we compare the full character string with the given Search Address. The 4 compares makes this a 4 way-set-associative directory. If we had 8 entries per congruence class, then 8 associative compares would be required and would constitute an 8-way set-associative directory. If no compare match was obtained, a MISS has resulted, requiring a reload. The Reloading strategy of a Miss is the mechanism which causes the 4 entries of a congruence class to be randomly arranged. This occurs as follows. When a Miss occurs for some given name, the usual strategy, and the one used for caches, is to subsequently enter that name into the directory under the assumption that it will be used a lot, for later accesses, i. e. perform a Reload. Under this assumption, the question then becomes, “Which entry to replace?” This has been the subject of considerable research over the years, but the most common and widely used strategy is to replace the entry which is LEAST RECENTLY USED (LRU). In a cache, there are special bits in each congruence class for keeping track of this usage. we could do the same in the phone directory, but usually do not bother. Rather, we would just look at the 4 names and use some similar criterion, such as, “which entry is least often used, or least important?” Since the physical location of this entry in the congruence class, in general, occur at random, there is no ORDER to the arrangement of entries in any congruence class.
In an actual cache, the “block” of data associated with each Search Address is usually many times larger than just a phone number or address, so the DATA storage space required is many times larger than the Search Address needed for the associative compares. As a result, the DATA is maintained in an array which is separate from the Search Address array. The latter is generally referred to as the Directory Array, or Tag Array. As a result, some mapping structure is required to relate the directory addresses to the corresponding data in the separate array.
The following describes directory-data array cache organization and accessing. Consider once again the above case in which both the Search Address for comparing, and the data reside in the same directory. Imagine that we wish to increase the size of the data by adding various records, such as home Address, Dept. Social Security#, work history, financial data, etc. shown in the box 220 in FIG. 2. we would also only need to access selected portions of this data at different times, e.g. find the Dept. or Social Security# or address, or whatever for a given person. However, if we keep it all as shown in FIG. 2, every inquiry accesses all the data for each of the 4 members of the congruence class which is not only inconvenient, but rather difficult to do in an actual random access storage array. A much better solution is to just store the data in a separate storage array, and maintain the same logical relationship between Search Address and data. Such a logical structure is the basis for a Late-Select organization.
A perfect analogy to an actual late-select organization can be obtained by using two of the tab-indexed directories, of the type used above, as illustrated in FIG. 3(a). The addresses are contained in the Directory on the left side 200, while the data, phone numbers and all other corresponding records are contained in the Storage Array on the right side 300.
There is still one congruence class for each letter of the alphabet. Also, there is still an exact one to one correspondence between addresses in the Directory and data in the array. In addition, we can easily include additional decoding on only the storage array, to select desirable fields on a finer level than previously. The usage of this structure is fully analogous to that used previously. Suppose we want to again use Kagan as the Search Address, but suppose we want to get the Department name (Dept.) 331 rather than phone number. Once again, we index to the K tab on both the directory array 200 and storage 300 array. we also provide another address field to the Storage array 300 only, namely, the lower address bits for the “Dept” field 331. Thus, the directory accesses 4 compare-addresses and the storage array accesses the 4 corresponding Department names. A compare HIT in the directory provides an Enable signal 341 to the storage array 300 for the correct 1 of 4 Dept.names as shown in FIG. 3. This is exactly the way a typical late-select cache works.
Notice that in the example above, access to the array was done at the same time as access to the directory. By the time the four address compares are completed, the four possible data fields are also accessed so it is only necessary to select one of the four using the compare HIT enable signal. In an actual cache, the directory and storage array would generally both be arrays of Static RAM devices with appropriate address decoders, sense amplifiers, etc. An address K (in binary) would be applied to both arrays and the internal information would be latched at the edge of each array in sense amplifier/latches. The directory has four compare circuits on the periphery of the array which does the remaining address matching. If a match is obtained, a direct enable signal is sent to the corresponding register on the storage array and the data is gated off to its destination, usually the processor. This is called a late-select organization since the data is accessed simultaneously with the addresses. If a COMPARE=Hit (“match”) occurs in the directory, the late-select signal from this match only has to enable the data out of the latch on the edge of the storage array. Late-select is an extremely fast organization for accessing a cache and used widely for the L1 cache level. If the directory and array can each be accessed in one processor cycle, which is usually the case, then a so-called one-cycle cache is achieved. This is facilitated by separating the Search Address directory 200 from the storage array 300, since one large array would be slower than 2 separate arrays in parallel as is used here. If a late-select organization is not used, some other method of identifying the desired L1 data logical word is needed. Depending on the method chosen, the directory could require additional address bits for this purpose. We do not consider such cases, but this would be important for determining the number of bits saved by a combined directory as is done later.
Now we consider the L2 Cache Organization. In an L2, the storage array is typically much slower than the L2 directory and usually requires multiple processor cycles to access the data. Also, the data path to the storage array typically accesses a full L1 block rather than a logical word. It is typically a 128 to 256 byte block Vs 8 byte L1 logical word. As a result, it usually makes no sense to use a late-select L2 organization. Thus the L2 directory/storage array access organization is often a sequential one in which the directory is accessed first, followed by access of the storage array. For a sequential organization, the storage array can still be logically partitioned into sets as for the late-select case above. However, the previous late-select signal which identifies the set is obtained before accessing the storage array, so this signal becomes part of the storage array address for the initial access. The L2 directory access could be accessed at the same time as the L1, and aborted if not needed.
The L2 Early Select Organization is as follows. The basic concept can again be illustrated with the aid of the phone directory 350 and storage array 360 as shown in FIG. 3(b). The storage array typically has an access time significantly greater than the directory. Initially any access is started simultaneously to the directory 350 and the array 360. In our phone directory 350/array 360 example, we would access the directory 350 identically to that previously in the late-select or sequential cases. However, we only do a partial access into the data array. In this case, we just move the tab index selector. It is assumed, for instance, that since the data array is large and slow, the time to move this array index selector may equal the time to do a full directory access and address compares. Once the latter are completed, we can then decide, if Hit or Miss, to Continue or Abort the remainder of the array access. If the directory Misses and causes an Abort, we can start a new access immediately without having to wait for a full, useless array access. Obviously, the storage array 360 must have this inherent partial access and Abort/Continue capability. This could be achieved with some small modifications to a standard DRAM or SRAM array. For example, this initial access into the array could be the physical word line decoding up to, but not including the word driver. This would be the equivalent of moving the array index selector 362 in FIG. 3(b). On a directory Hit, the Early Select signal would enable the word line driver and remainder of the array access. A directory Miss would Abort the word line driving, and reset the word decoding for the next access on the next cycle. In this manner, one full cycle of access could be eliminated from the array access, depending on the actual array access parameters.