1. Field of the Invention
The present invention relates to techniques for accessing memory units in a data processing apparatus.
2. Description of the Prior Art
A data processing apparatus will typically include a processor core for executing data processing operations. A memory system will then be made available to the processor core for storing data and/or instructions required by the processor core to perform such data processing operations. Hence, the processor core will receive instructions and associated data from the memory system, will execute those instructions, and optionally will output data for storing back in the memory system. Hereafter, the term xe2x80x9cdata valuexe2x80x9d will be used to refer to both instructions and data. When a data value is to be transferred to/from the memory system, the processor core will issue an access request specifying that transfer.
A typical memory system will include a main memory, also referred to herein as an external memory, which can store the data values required by the processor core. However, the retrieval of data values from that main memory, and the writing of data values back to that main memory, is typically a relatively slow process, and accordingly it is known to provide one or more memory units in addition to the main memory within the memory system. A well-known example of such an additional memory unit is a cache, which can be used to store data values retrieved from the main memory, and/or data values output by the processor core, so that those data values are readily available to the processor core if required for subsequent data processing operations. It will be appreciated by those skilled in the art that there are a number of well-known techniques for determining which data values get stored within the cache, and which data values get evicted from the cache when new data values need storing within the cache. However, fundamentally, the cache is typically relatively small compared to the main memory, is significantly quicker to access than the main memory, and is aimed at temporarily storing data values that are likely to be needed by the processor core.
The memory system may include a single cache, or alternatively may contain a plurality of caches arranged, for example, in a hierarchical structure.
In addition, another type of memory unit that may be included within the memory system is a tightly-coupled memory (TCM), which is typically connected to the processor bus on which the processor core issues access requests, and is used to store data values for which a deterministic access time is required. The TCM presents a contiguous address space to a programmer, which can be used to store data values, and hence, as an example, a particular portion of code for which a deterministic access time is important can be stored directly in the TCM. The TCM can be used as if it were a particular portion of the main memory (i.e. the data values in the TCM are not replicated in the main memory), or alternatively the data values to be placed in the TCM can be copied from the main memory. Typically, a register somewhere within the data processing apparatus will keep a record of the address range of data values placed in the TCM so that it can be determined whether a particular data value the subject of an access request by the processor core will be found in the TCM or not. The TCM may be embodied in any appropriate form, for example, Random Access Memory (RAM), Read Only Memory (ROM), etc.
In a data processing apparatus of the above type, where the memory system comprises a plurality of memory units, an access request issued by a processor core is typically analysed to determine which memory unit should be used to perform the access. For example, if the access request relates to a read of a data value, and the address issued as part of the access request relates to a cacheable area of memory, then it is appropriate to access the cache to determine whether that data value is present in the cache. If it is, then the data value can be returned directly to the processor core, whereas if it is not, then typically a linefill procedure will be invoked to read a number of data values, including the data value of interest, from external memory, and to then place those retrieved data values in a line of the cache.
Similarly, if having reference to the register storing the address range of data values stored in the TCM, it is determined that the data value resides in the TCM, then it is clearly appropriate to access the TCM to retrieve the data value required by the processor core.
However, to achieve desired performance levels for performing accesses, there is not typically sufficient time to wait for the above-described analysis of the access request to be completed before the access to the appropriate memory unit is initiated. Instead, for performance reasons, it is typically required to simultaneously perform the access to multiple of the memory units, so that by the time the analysis of the access request has taken place, and the appropriate memory unit to access has hence been determined, that memory unit is already in a position to complete the access (for example by outputting the desired data value to the processor core for a read request, or storing the required data value for a write request). Further, any output generated by the other memory units that have been accessed, but which in hindsight need not have been, can be ignored.
For example, if a cache lookup took place and resulted in a cache miss, but the results of the analysis of the access request indicated that the data value was in a non-cacheable region of memory, then the fact that the cache miss occurred can be ignored, rather than invoking the usual procedure of performing a linefill to the cache. Similarly, if the address specified by the access request is outside of the range of the addresses stored within the TCM, then the TCM will still typically generate an output based on that portion of the address which is within the range of addresses for data stored within the TCM. However, once the analysis of the access request indicates that the data value is not within the TCM, that output from the TCM can be ignored.
Whilst from a performance point of view the above approach of speculatively accessing multiple memory units, and then qualifying their outputs based on the results of the analysis of the access request, enables the required performance for accesses to be achieved, such an approach consumes significant power, since more memory units are accessed that actually is required to perform the access request issued by the processor core. For example, in a system employing a cache and a TCM, if the access request actually specifies a data value contained within the TCM, then the cache will unnecessarily have been driven to perform an access, whilst similarly if the access request relates to a cacheable data value, the TCM will unnecessarily have been driven to perform the access.
Accordingly, it would be desirable to provide a more power efficient technique for performing memory accesses, which does not unduly impact performance.
Viewed from a first aspect, the present invention provides a data processing apparatus, comprising: a plurality of memory units for storing data values; a processor core for issuing an access request specifying an access to be made to the memory units in relation to a data value; a memory controller for performing the access specified by the access request; attribute generation logic for determining from the access request one or more predetermined attributes verify which of the memory units should be used when performing the access; prediction logic for predicting the one or more predetermined attributes; clock generation logic responsive to the predicted predetermined attributes from the prediction logic to select which one of the memory units is to be clocked during performance of the access, and to issue a clock signal to that memory unit; checking logic for determining whether the predetermined attributes generated by the attribute generation logic agree with the predicted predetermined attributes, and if not, for reinitiating the access, in which event the clock generation logic is arranged to reselect one of the memory units using the predetermined attributes as determined by the attribute generation logic.
Hence, in accordance with the present invention, attribute generation logic is provided to determine from an access request one or more predetermined attributes identifying which of the memory units should be used to perform the access. However, for performance reasons, the memory controller begins to perform the access specified by the access request without waiting for the attribute generation logic to finish its determination. However, in contrast to the earlier described prior art technique, the access is not speculatively performed across multiple memory units, but instead prediction logic is provided to predict the one or more predetermined attributes, and clock generation logic is provided that is responsive to the predicted predetermined attributes to select which one of the memory units to clock during performance of the access, and to issue a clock signal to that memory unit. Accordingly, taking the earlier example of a data processing apparatus that includes a cache and a TCM, if the predicted predetermined attributes indicate that the access request relates to a cacheable data value, then the cache will be clocked, but the TCM will not.
In accordance with the present invention, the data processing apparatus also includes checking logic which, once the attribute generation logic has determined the predetermined attributes, is arranged to determine whether those predetermined attributes agree with the predicted predetermined attributes. If they do, then no action is required, as the access will have been performed correctly based on the predicted predetermined attributes. However, if the predetermined attributes do not agree with the predicted predetermined attributes, the access is reinitiated, in which the event the clock generation logic is arranged to reselect one of the memory units using the predetermined attributes rather than the predicted predetermined attributes.
Accordingly, it can be seen that the present invention, when used with a reasonably accurate prediction scheme, reduces power consumption by avoiding parallel accesses to multiple memory units, at the expense of a relatively small loss in performance due to occasional misprediction of the memory unit to be accessed.
It is possible for the data processing apparatus to include a generic memory controller for controlling accesses to any of the plurality of memory units. However, in preferred embodiments, the memory controller comprises a plurality of memory controllers, each memory controller being associated with a different memory unit, and the clock generation logic is arranged to clock the selected memory unit and its associated memory controller during performance of the access. With such an approach, it is possible not only to save power by not clocking any memory units other than the one indicated by the predicted predetermined attributes, but additionally power can be saved by not clocking any of the associated memory controllers for those non-clocked memory units.
It will be appreciated that the predetermined attributes can take a variety of forms, and may be determined in a number of different ways. However, in preferred embodiments, the access request specifies an address relating to the data value, and the attribute generation logic is arranged to determine the predetermined attributes dependent on the address. In such embodiments, it will be apparent that the address need not be used in isolation to determine the predetermined attributes, but may be used in combination with other information, such as the TCM region register settings, page table attributes, etc.
It will be apparent that the present invention may be utilised in any apparatus in which multiple memory units are used. However, in preferred embodiments, a first memory unit is tightly coupled memory for storing data values to which the processor core requires deterministic access. TCMs are typically relatively large compared with caches, and hence consume more power to clock speculatively as is done in the earlier described prior art techniques. Accordingly, in embodiments where one of the memory units is a TCM, significant power savings can be made by employing the techniques of the preferred embodiment of the present invention.
Furthermore, in preferred embodiments, a second memory unit is a cache.
It will be appreciated that the attribute generation logic may take a variety of forms. However, in preferred embodiments, the attribute generation logic is contained within a memory management unit (MMU) arranged to generate for each access request a number of attributes including the predetermined attributes. Typically, the data processing apparatus will already include an MMU, the MMU being responsible for analysing access requests in order to generate certain attributes, for example a physical address assuming the address output by the processor core is a virtual address, an indication as to whether the data value is cacheable, an indication as to whether the data value is bufferable, etc. By arranging the MMU to include within the attributes that it produces the predetermined attributes required in preferred embodiments of the present invention, a particularly efficient embodiment can be realised, since use is made of the pre-existing circuitry of the MMU.
In preferred embodiments, the MMU comprises a table lookaside buffer for comparing an address specified by the access request with predetermined addresses in the table lookaside buffer, for each predetermined address the table lookaside buffer containing the number of attributes needing to be generated by the MMU. Hence, in this embodiment, the attributes, including the predetermined attributes required in accordance with preferred embodiments of the present invention, are precoded into the table lookaside buffer, such that they can be output directly when an address match is determined by the table lookaside buffer. In an alternative embodiment, additional circuitry may be provided to generate the predetermined attributes from the attributes generated by a standard table lookaside buffer of an MMU.
It will be appreciated that there are a number of different ways in which the clock generation logic can be arranged to selectively provide clock signals to the various memory units dependent on the predicted predetermined attributes and/or the actual predetermined attributes from the attribute generation logic. However, in preferred embodiments, the checking logic is arranged to generate a mispredict signal if the predetermined attributes do not agree with the predicted predetermined attributes, and the clock generation logic comprises clock signal gating circuitry for each memory unit, each clock signal gating circuitry receiving a system clock signal and outputting that system clock signal to the associated memory unit if either the predicted predetermined attributes indicate that the associated memory unit should be used for the access, or the mispredict signal is generated and the actual predetermined attributes generated by the attribute generation logic indicate that the associated memory unit should be used for the access.
It will be appreciated that the prediction logic can take a variety of forms, dependent on the prediction scheme used. Further, it will be appreciated that there are many different known prediction schemes, and any suitable prediction scheme can be used to predict the predetermined attributes. However, in preferred embodiments, the prediction logic bases the predicted predetermined attributes for a current access request on the actual predetermined attributes generated by the attribute generation logic for a preceding access request. It has been found that this provides reliable prediction in preferred embodiments of the present invention, since the processor core often issues a series of access requests relating to data values stored in the same memory unit
Viewed from a second aspect, the present invention provides a method of accessing memory units in a data processing apparatus, the data processing apparatus comprising a plurality of memory units for storing data values, a processor core for issuing an access request specifying an access to be made to the memory units in relation to a data value, and a memory controller for performing the access specified by the access request, the method comprising the steps of: a) determining from the access request one or more predetermined attributes verifying which of the memory units should be used when performing the access; b) prior to completion of said step (a), performing the steps of: (i) predicting the one or more predetermined attributes; (ii) responsive to the predicted predetermined attributes generated at said step (b)(i), selecting which one of the memory units is to be clocked during performance of the access; (iii) issuing a clock signal to the memory unit selected at said step (b)(ii); and (iv) causing the memory controller to perform the access; c) once the determination at said step (a) is completed, determining whether the predetermined attributes generated at said step (a) agree with the predicted predetermined attributes generated at said step (b)(i), and if not, reinitiating the access, in which event one of the memory units is selected using the predetermined attributes determined at said step (a), a clock signal is issued to that memory unit, and the memory controller then reperforms the access.