It is known to provide address translation circuitry configured to receive a first address in a first addressing system and to translate it into a second address in a second addressing system. For example this may involve receiving a virtual address from a master device and to translate the virtual address into a physical address used in a memory system to which the master device requires access. Such address translation circuitry may for example be provided in a memory management unit (MMU) comprising an MMU translation lookaside buffer (TLB) and potentially a number of satellite μTLBs. Contemporary high-end processor cores, such as graphics processing units (GPUs) and application central processing units (CPUs) are often implemented with a distributed MMU, comprising a central μTLB and several satellite μTLBs. Indeed, in a multi-core device, even more μTLBs are often provided.
Even though a distributed MMU will generally not use a significant amount of power in the context of the power consumption of the whole chip on which it is found, the absolute power dissipated performing address translations is non-trivial. Furthermore, such address translation circuitry may often be provided in a pipelined format, which may bring efficiencies when the pipeline is relatively full, but nevertheless has an inherent latency, due to the number of pipelined stages, which can be a disadvantage.
It is known to implement “memoization” techniques in which a memo table is provided in which previously seen address translation are stored to enable a short cut to be taken to provided that address translations. A similar technique involves the provision of a line buffer performing a similar function. If such a memo table or line buffer is provided to act as a variety of “single-line L0” for a cache, the whole cache can be bypassed on a hit, but this is only possible when there is some sort of arbitration provided to determine where the response has come from (line buffer or cache pipeline), which may require the cache pipeline to be stallable which is undesirable for performance reasons. Also, such techniques do not naturally preserve the ordering of operations which can be a problem in some applications and would require additional logic and buffering to work. Furthermore, in the highly space-constrained environment of a contemporary on-chip device, the provision of an additional line buffer or memo table is disadvantageous because of the on-chip area which it consumes.
Some previous works dealing with “memoization” and line buffer techniques are: “Hardware Memoization of Mathematical and Trigonometric Functions”, D. Citron and D. Feitelson, School of Computer Science and Engineering, The Hebrew University of Jerusalem, 26 Mar. 2000; “Way Memoization to Reduce Fetch Energy in Instruction Caches”, A. Ma, M. Zhang and K. Asanovic, ISCA Workshop on Complexity Effective Design, 2001, MIT Laboratory for Computer Science; and “Reducing Power in Superscalar Processor Caches using Subbanking, Multiple Line Buffers and Bit-Line Segmentation”, K. Ghose and M. Kamble, International Symposium on Low Power Electronics and Design, 1999, pp. 70-75.
It would be desirable to provide address translation circuitry which improves on the prior art.