The present invention relates to a data processing system and method involving a data requesting element and a memory element from which said data requesting element requests data. An example of such a system is a processor and a first level cache memory, or two memories arranged in a hierarchy.
The concept of a hierarchical memory structure is known in the art. The term xe2x80x9chierarchicalxe2x80x9d implies that instead of having a single memory from which data is requested, a hierarchy of levels is used, where data is first requested by e.g. a processor from a first level memory, and if the requested data is present in the first level memory (which is also referred to as a xe2x80x9chitxe2x80x9d), it is provided to the processor. If not (which is also referred to as a xe2x80x9cmissxe2x80x9d), a request is given to a second level memory provided below the first level memory in the hierarchy. If the data is present in the second level memory, then it is provided to the processor from there, and possibly also stored in the first level memory. A third level may be provided below the second level, and further levels below that. An example of such a structure is processor using a memory structure having a first and second level cache, below that a main memory, and below that a disk memory.
The memories are organized in such a way that higher level memories tend to be smaller and faster (in terms of access) than lower level memories. The advantages of such a structure will be explained further on.
In more detail, as shown schematically in FIG. 12, a conventional data processing arrangement with a hierarchical memory typically comprises a processor or CPU (central processing unit) 10 that contains a program counter 11 containing instruction addresses to be performed, said program counter being controlled by a control unit 12. A computational element 13 or ALU (arithmetic logic unit) performs operations on data held in registers 14 under the control of the control unit 12 in accordance with the instructions indicated by the addresses from the program counter. A main memory 30 is provided for storing program data under the corresponding instruction addresses. The main memory 30 is a RAM type memory that will typically be connected to a slow memory with large volume, such as a hard disk drive 40. A cache memory 20 is arranged as an intermediate memory between the main memory 30 and the CPU 10 for storing part of the program data under the corresponding instruction addresses.
The instruction execution performed by the processor is typically pipelined, which means that the multiple steps of successive instructions are performed in overlap. In other words, each instruction is broken down into a predetermined number of basic steps (e.g. fetch, decode, operate and write), and a separate hardware unit is provided for performing each of these steps. Then these steps can be performed in overlap for consecutive instructions during one cycle, e.g. while the write step is being performed for a first instruction, simultaneously the operate step is performed for a second instruction, the decode step is performed for a third instruction and the fetch step is performed for a fourth instruction. This is well known in the art and need not be explained further here.
A memory hierarchy using a cache in addition to the main memory takes advantage of locality and cost/performance of memory technologies. The principle of locality says that most programs do not access all code or data uniformly. This principle, plus the guideline that smaller hardware is faster, leads to the hierarchy based on memories of different speeds and sizes. Since fast memory is expensive, a memory hierarchy is organized into several levels, each smaller, faster, and more expensive per byte than the next level. The goal is to provide a memory system with cost almost a low as the cheapest level of memory and speed almost as fast as the fastest level. The levels of the hierarchy usually subset one another; all data in one level is also found in the level below, and all data in that lower level is found in the one below it, and so on until the bottom of the hierarchy is reached. Normally, each level maps addresses from a larger memory to a smaller but faster memory higher in the hierarchy. Present terminology calls high-level memories cache memories. It is known to provide a plurality of cache levels.
For example, as can be seen in FIG. 12, the cache memory 20 stands higher in the hierarchy than main memory 30, and main memory 30 stands higher in the hierarchy than disk drive 40. When the CPU 10 requests data, it first requests the data from the cache 20. In the event of a miss, the data must be fetched from the main memory 30, and if again a miss occurs, it must be fetched from the disk drive 40. Typically, the CPU will output virtual addresses, i.e. addresses that define a virtual address space, whereas the data will be stored at physical addresses. The actual reading out of data from one of the memories therefore usually requires an address translation from virtual to physical.
Data is read into each of the memories in specific data units. In the case of the main memory 30 such a data unit is called a page, in the case of the cache memory 20 it is called a line or block. Each page or line consists of a number of data words. The CPU 10 can read data out of cache 20 in any desired way, be it in units of lines or in units of words.
Data in a cache memory are organized by directories which are called address tags. Usually, a group of data is associated with one tag. For example, data associated with tag 0123X might have addresses 01230 through 01237. This group of data e.g. forms the above mentioned cache line. Usually, a cache directory behaves associatively, that is, the cache directory retrieves information by key rather than by address. To determine if a candidate address is in the cache, the directory compares the candidate address with all addresses now in the cache. To maintain high speed, this operation must be done as quickly as possible, which should be within one machine cycle. Furthermore, a cache memory is called set associative if the cache is partitioned into distinct sets of lines, each set containing a small fixed number of lines. In this scheme, each address reference is mapped to a particular set by means of a simple operation on the address. If the address is in the cache, then it is stored as one of the lines in the set. Therefore, the cache need not be searched in its entirety. Only the set to which the address is mapped needs to be searched. If a match is found, then the corresponding data line of the cache is gated to the cache output-data buffer, and from there it is transmitted to the computational unit. In summary, there are three parameters for characterizing a cache, namely the number of bytes per line, the number of lines per set and the number of sets. A cache in which the directory search covers all lines in the cache is said to be fully associative, which corresponds to the case when the number of sets is 1.
In the cache memory some active portion of the low-speed main memory is stored in duplicate. When a memory request is generated, the request is first presented to the cache memory, and if the cache cannot respond, the request is then presented to main memory. If an item is not resident in the cache but in the main memory, this constitutes the above mentioned cache miss. Assuming e.g. that a tag 0124X is not present, then a reference to address 01243 produces a miss for the cache since no tag matches this address. The item is then retrieved from main memory and copied into the cache. During the short period available before the main-memory operation is complete, some other item in cache is removed from the cache to make room for the new item. Special replacement algorithms deal with the cache-replacement decision. A well known strategy is the LRU (least recently used). According to the LRU replacement algorithm a cache line which was not used for the longest time will be overwritten by a page from the main memory.
A similar situation exists when fetching data from the main memory 30, except that the lack of the requested data is referred to as a page fault. In the event of a page fault, new page containing the requested data must be loaded from the disk drive 40, and another page in the memory must be discarded in order to make room for the new page. The main memory therefore also has a dedicated replacement algorithm.
It is understandable that a primary goal of designing a memory system is to avoid misses as far as possible, and it is equally understandable that one aspect in this connection is the choice of an appropriate replacement algorithm at each level.
Misses in caches can be classified into four categories: conflict, compulsory, capacity and coherence misses (see e.g. N. P. Jouppi: Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. The 17th International Symposium on Computer Architecture Conference proceedings (ISCA-17), 1990) internet-publication http://www.research.digital.com/wrl/techreports/abstracts/TN-14.html). Conflict misses are misses that would not occur if the cache was fully-associative and had LRU replacement. Compulsory misses are misses required in any cache organization because they are the first references to an instruction or piece of data. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses that occur as a result of invalidation to preserve multiprocessor cache consistency.
One obvious way of reducing the number of capacity and compulsory misses is to use longer line sizes, i.e. to increase the capacity of the memory. However, line sizes cannot be made arbitrarily large without increasing the miss rate and greatly increasing the amount of data to be transferred.
Another concept that complements the concept of the replacement algorithm is prefetching (see e.g. xe2x80x9cRechnerarchitekturxe2x80x9d by J. L. Hennessy and D. A. Patterson, Vieweg Verlag). Prefetching means that an algorithm is implemented for selecting data units in expectation of their being requested later. In other words, in the example of a cache, this means that not only is the cache line loaded that contains data belonging to miss, but one or more further cache lines are loaded, where the rules for choosing such supplementary lines are determined by the prefetch algorithm. These rules are associated with some sort of concept of prediction for the future behaviour of the system. Prefetch techniques are interesting because they can be more adaptive to the actual access patterns of the program than simply increasing the cache size. This is especially important for improving the performance on long quasi-sequential access patterns such as instruction streams or unit-stride array accesses.
Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely out-come of branch instructions. A well known fetch and branch prediction mechanism (see e.g. B. Calder, D. Grunwald: Next Cache Line and Set Prediction, The 22th International Symposium on Computer Architecture Conference proceedings (ISCA-22), 1995, internet-publication http://www-cs.ucsd.edu/xcx9ccalder/abstracts/ISCA-NLS-95.html) is the use of branch target buffers (BTB), for which the Intel Pentium is an example. The Intel Pentium has a 256-entry BTB organized as four-way associative cache. Only branches that are xe2x80x9ctakenxe2x80x9d are entered into the BTB. If a branch address appears in the BTB and the branch is predicted as taken, the stored address is used to fetch future instructions, otherwise the fall-through address is used. For each BTB entry, the Pentium uses a two-bit saturating counter to predict the direction of a conditional branch. In this BTB architecture the branch prediction information (the two-bit counter), is associated or coupled with the BTB entry. Thus, the dynamic prediction can only be used for branches in the BTB, and branches that miss in the BTB must use less accurate static prediction. In other words, the BTB keeps a dynamic record of branch events.
From the same paper by B. Calder and D. Grunwald an alternative computer system having a cache memory with a fetch and branch prediction mechanism is known. The instruction following a branch is fetched by using an index into the cache, which is called a next cache line and set (NLS) predictor. An NLS predictor is a pointer into the instruction cache, indicating the target instruction of a branch. The NLS predictor is either decoupled from the cache line and is stored in a separate tag-less memory buffer (referred to as an NLS-table), or is directly stored together with the cache lines (referred to an NLS-cache). It is assumed that during the instruction fetch stage of the pipeline, each instruction can easily be identified as a branch or non-branch instruction. This can be done either by providing a distinguishing bit in the instruction set or storing that information in the instruction cache.
For the next instruction fetch there are three predicted addresses available. These are the NLS predictor, the fall-through line (previous predicted line+fetch size) and the top of a return stack, e.g. with instructions after a return from a subroutine. The NLS predictor on the other hand contains three fields, the type field, the line field and the set field. The type field shows the possible prediction sources, namely a conditional branch, other types of branches, the return instruction and an invalid bit for an invalid NLS predictor. The line field contains the line number to be fetched from the instruction cache. The set field is used to indicate where the predicted line is located if a multi-associative cache is used. It is not needed for a direct mapped cache.
If the instruction being fetched from the instruction cache indicates that it is a branch instruction, the NLS predictor is used and the type field is examined to choose among the possible next fetch addresses. Return instructions use the return stack. Unconditional branches and indirect branches use the cache line specified by the NLS entry. If the type field indicates a conditional branch, the architecture uses the prediction given by a pattern history table (PHT) which combines the history of several recent branches to predict the outcome of a branch.
If the branch is predicted as taken, the NLS line and set fields are used to fetch the appropriate cache line and instruction from the instruction cache. If the conditional branch is predicted as not-taken, the pre-computed fall-through line address is used on the next instruction fetch.
The NLS entries are updated after instructions are decoded and the branch type and destinations are resolved. The instruction type determines the type field and the branch destination determines the set and line field. Only taken branches update the set and line field, but all branches update the type field. A conditional branch which executes the fall-through does not update the set and line field, so that the pointer to the target instruction is not erased. For conditional branches, this allows the branch prediction hardware to use either the NLS predictor for taken conditional branches or to use pre-computed fall-through line, depending on the outcome of the PHT.
From M. Johnson: Superscalar Microprocessor Design, Prentice Hall, Englewood Cliffs, N.J., 1990, pages 71-77, a branch prediction is known which is based on special instruction-fetch information included in the cache entries. The fetch information contains a conventional address tag and a successor index field as well as a branch block index field. The successor index field indicates both the next cache block predicted to be fetched and the first instruction within this next block predicted to be executed. The branch block index field indicates the location of a branch point within the corresponding instruction block.
To check each branch prediction, the processor keeps a list in an array of predicted branches ordered by the sequence in which branches were predicted.
When a branch is executed, the processor compares information related to this branch with the information at the front of the list of predicted branches, which is the oldest predicted-taken branch. The following conditions must hold for a successful prediction:
If the executed branch is taken, its location in the cache must match the location of the next branch on the list of predictions.
If the location of the executed branch matches the location of the oldest branch on the list of predictions, the predicted target address must equal the next instruction address determined by executing the branch.
If either of the foregoing conditions does not hold, the instruction fetcher has mispredicted a branch. The instruction fetcher uses the location of the branch determined by the execution unit to update the appropriate cache entry.
From the above mentioned article by Jouppi a memory hierarchy having a first level cache, a second level cache, and so called stream buffers in between is known. A stream buffer consists of a series of entries, each consisting of a tag, an available bit, and a data line. When a miss occurs in the cache that is at a higher hierarchical level than the stream buffer, the stream buffer begins prefetching successive lines starting at the miss target, from the memory element provided at a lower hierarchical level, e.g. a lower level cache. As each prefetch request is sent out, the tag for the address is entered into the stream buffer, and the available bit is set to false. When the prefetch data returns, it is placed in the entry with its tag and the available bit is set to true.
The stream buffers are considered as FIFO queues, where only the head of the queue has a tag comparator and elements removed from the buffer must be removed strictly in sequence without skipping any lines. A line miss will cause a stream buffer to be flushed and restarted at the miss address even if the requested line is already present further down in the queue.
Furthermore, Jouppi also mentions more complicated stream buffers which can provide already-fetched lines out of sequence. Such a stream buffer that also has comparators for other entries than the head of the queue, is referred to as a quasi-sequential stream buffer. Also, an arrangement is disclosed in which a number of stream buffers are connected in parallel, and when a miss occurs in the high level cache, all of the parallel stream buffers are searched. Such a parallel arrangement is referred to as a multi-way stream buffer. When a miss occurs in the data cache that does not hit in any stream buffer of the plurality, the stream buffer hit least recently is cleared (i.e., LRU replacement) and it is started fetching at the miss address.
Subsequent accesses to the cache also compare their address against the first item stored in the buffer. If a reference misses in the cache but hits in the buffer the cache can be reloaded in a single cycle from the stream buffer.
In summary, the Jouppi reference discloses placing a stream buffer between a first level cache and the next slower memory in the hierarchy, where a prefetch from said slower memory is initiated by a cache miss. The reference by Calder and Grunwald discloses the use of an NLS predictor, where prefetching is always conducted in accordance with this predictor.
Although the above mentioned prefetch mechanisms can already handle flow control, these mechanisms still show a decrease in computing speed if there is a code with frequent and short jumps. In particular, such code portions are used for applications in telecommunications such as exchange computing.
Therefore, it is an object of the invention to provide a data processing system with a hierarchical memory which shows an efficient data exchange control, especially for applications in telecommunications.
This object is solved by a data processing system according to claim 1. Advantageous embodiments are described in the dependent claims.
The first memory element can e.g. be an instruction cache memory and the data requesting element can e.g. be a processor. The data units to be read from the first memory element can be then be cache lines, but equally well be data words or any other unit suitable for the desired purpose. The data identifiers can e.g. any suitable type of address, be it physical or virtual. In this example, in which the data requesting element is a processor, the element for establishing a sequence of data identifiers is the program counter in the processor, where said program counter defines a sequence of instruction data identifiers, i.e. instruction addresses. Although this is a preferred embodiment that will be described in detail further on, it may be noted that the data requesting element could itself comprise a memory that requests data from a lower level memory, where the element for establishing a sequence of data identifiers could then again be the program counter, but could also be any other suitable control element that determines a sequence as specified in claim 1. In other words, the present invention is by no means restricted to being implemented at the highest level of the memory hierarchy (i.e. next to the processor), but can also be implemented between lower level elements in the hierarchy.
The second memory element is provided between the first memory element and the data requesting elements in terms of the hierarchy. In other words, the data requesting element provides data requests (e.g. a desired instruction address) to the second memory element, where the desired data is supplied if it is present in the second memory element (i.e. in case of a hit), and where a demand for this desired data is provided to the first memory element if the data is not present (i.e. in case of a miss). It may be remarked that in the present specification and claims, for the purpose of clarity, the term xe2x80x9crequestxe2x80x9d will refer to data asked for by the data requesting element, and the term xe2x80x9cdemandxe2x80x9d will refer to data asked for by the second memory element. It should be noted that the data request issued by the data requesting element can be identical to the data demand issued by the second memory element, e.g. one and the same virtual address, but it is equally well possible that the demands use a different addressing scheme than the requests.
The second memory element is preferably a stream buffer as described in the Jouppi reference, and more preferably a quasi-sequential multi-way stream buffer. However, it is clear that any suitable storage means can be used, e.g. a simple flip-flop could also be used, or the second memory element could also be arranged and organized like a cache memory.
In accordance with the present invention, the second memory element is operable to perform a prefetch procedure for data units from said first memory element. A first sub-procedure performs a prefetch in accordance with a prefetch data identifier stored in association with a given data unit. More specifically, upon detecting a first predetermined change in status of the second memory element, a first given data unit is determined in the second memory element, which is associated with this first predetermined change in status. The predetermined change in status can for example be the reading out of a data unit from the second memory element, in which case the given data unit can be the data unit that was read out, or the predetermined change in status can be the loading of a data unit into the second memory element, in which case the given data unit can be the data unit that was loaded.
Then it is checked if the first given data unit fulfils a predetermined condition, where the predetermined condition relates to a prefetch data identifier stored in association with said first given data unit. The prefetch data identifier identifies a different data unit than said first given data unit. In other words, the prefetch identifier is not the address of the given data unit, but much rather the address of another data unit.
The storage of the prefetch data identifier in association with a data unit can be arranged in any desired way, e.g. together with the data unit itself but as separate units, together with the data unit and as a part of the data unit, or in a separate table using the data identifier (address) of the given data unit as a reference.
The predetermined condition that relates to the prefetch data identifier can be the simple checking if such an identifier is present at all, e.g. by checking a specific field that is reserved for the prefetch data identifier (be it in the data unit itself or in a separate table) contains data other than zero, or the predetermined condition can also be the checking of a specific indicator, such as a prefetch data identifier valid bit. If the predetermined condition is fulfilled, at least the data unit identified by the prefetch data identifier is fetched. xe2x80x9cAt leastxe2x80x9d means that other data units may also be fetched together with the data unit identified by the prefetch data identifier, for example data units identified by data identifiers following the prefetch data identifier in the sequence, or data units following the data identifier belonging to the given data unit.
A second sub-procedure is implemented for performing a sequential prefetch. In other words, upon detecting a second predetermined change in status of said second memory element, a given data unit associated with said second predetermined change in status is determined, and at least the next data unit in the sequence of data identifiers is fetched. xe2x80x9cAt leastxe2x80x9d again means that additionally other data units may be fetched together with the next data unit, e.g. the next two or three data units. The second predetermined change in status can be completely independent of the first predetermined change in status, e.g. may involve reaching a limit related to the filling of the second memory element, such as a low water mark, or can be coupled to the first predetermined condition. An example of the latter case is that the determination of the second change of status comprises determining the first change of status. This can mean that e.g. the second change in status is determined if the first change in status is determined (e.g. a read out of a specific data unit) and an additional condition is met, e.g. that the first given data unit does not fulfil the predetermined condition (e.g. the prefetch data identifier valid bit is not set). In this case the prefetch sub-procedure on the basis of the prefetch data identifier and the sequential sub-procedure are operated alternatively. But it is equally well possible that the additional condition is identical to the first predetermined condition, namely that this condition is met (e.g. the prefetch data identifier valid bit is set), such that the two sub-procedures are conducted in conjunction.
As described above, the present invention comprises a prefetching concept that involves both sequential prefetch and prefetch in accordance with a prefetch data identifier. Each are conducted under a corresponding condition, where the conditions may be different, coupled or the same. This provides great flexibility and efficiency. Especially, it is possible to simultaneously cope both with program jumps and with sequential procedure in a simple and effective manner, as the prefetch procedure takes both situations into account.
It may be noted that although the Jouppi reference teaches the use of stream buffers, these stream buffers are placed between a cache and the next slower memory and controlled in accordance with the contents and status of the cache. The present invention, when applied to the situation described by Jouppi, would consist in placing the second memory element between the cache and the processor, i.e. above the cache, not below. Also, the prefetch indicated in the Jouppi reference is only initiated in the event of a miss in the cache, such that the system of the present invention is far more flexible. The reference by Calder and Grunwald teaches always using an NLS predictor with respect to a prefetch, so that again no flexibility is achieved.
The process of allocating certain prefetch data identifiers to certain data units and/or their respective data identifiers and not allocating such prefetch data identifiers to other data units, i.e. the selection of certain data units as having a prefetch data identifier and the validation of specific prefetch data identifiers can in principle be done in any suitable or desirable way. According to a preferred embodiment, this is done by introducing a third memory element for storing data identifiers that identify the data most recently requested by the data requesting element. These data identifiers are stored in the order of their last having been requested.
The management or updating of the prefetch data identifiers stored in association with certain data units is then accomplished by performing a procedure such that if data identified by a data identifier provided by the data requesting element to said second memory element as a data request is not present in said second memory element (i.e. in the event of a miss), the data identifier for which no related data is present in said second memory element is associated with a data identifier belonging to a previous request stored in the third memory element, and then the data identifier for which no related data is present in said second memory element is stored as a prefetch data identifier in association with the data unit in said first memory element identified by said previous data request identifier.
Preferably the third memory element will simply queue a predetermined number of data identifiers that belong to the last data units read out of the second memory element. If a miss occurs in the second data memory, then the data identifier (address) identifying the data unit that missed will be xe2x80x9cwritten backxe2x80x9d as a prefetch data identifier to one of the previous data identifiers in the queue of the third memory element. As is understandable, each of the data identifiers in the third memory element identifies a hit. The depth of writing back (i.e. will the data identifier be associated with the last data unit read out, the second last, the third last . . . etc.) depends on the specific system, such as on the latency etc. The depth of the third memory element (i.e. the number of queued identifiers) and the depth of writing back should be chosen appropriately. Namely, by performing the above write back procedure in the event of a miss in the second memory element, a connection is established between the previous data unit and the present data unit. As the miss of the present data unit is an indication of a jump, using the data identifier of the missed data unit as a prefetch data identifier for the previous data unit provides a selective record of this jump, such that the chances of avoiding the miss in the wake of the next request for the previous data unit are greatly diminished, at least assuming that it is probable that the same jump will be performed.
The invention will now be described by way of preferred embodiments, which serve to exemplify the invention but should by no means be seen as restrictive, and with reference to the accompanying drawings in which:
FIG. 1 shows a basic arrangement of a data processing system of the invention;
FIG. 2 is a schematic representation of a stream buffer used in the present invention;
FIG. 3 is a schematic representation of the sequence of instruction addresses produced by a program counter and showing jumps in said sequence;
FIG. 4 is a flowchart showing a basic embodiment of the method of the present invention;
FIG. 5 is a flowchart that shows a specific embodiment based on the method of FIG. 4;
FIG. 6 shows a flowchart describing another modification of the basic method of FIG. 4;
FIG. 7 shows another flowchart which is a further modification of the basic method of FIG. 4;
FIG. 8 is a flowchart that shows a preferred method for updating the prefetch data identifiers associated with certain data units;
FIGS. 9a and 9b are schematic representations for explaining the preferred process of updating the preferred data identifiers;
FIG. 10 is a preferred embodiment of the data processing system according to the present invention;
FIG. 11 is a schematic representation for describing the operation of the system of FIG. 10; and
FIG. 12 is a schematic representation that shows a basic memory hierarchy.