FIG. 1 shows an embodiment of a processor 100. The processor 100 may be any one of a variety of processors such as a central processing unit (CPU) or a graphics processing unit (GPU). For instance, they may be x86 microprocessors that implement x86 64-bit instruction set architecture and are used in desktops, laptops, servers, and superscalar computers, or they may be Advanced RISC (Reduced Instruction Set Computer) Machines (ARM) processors that are used in mobile phones or digital media players. Other embodiments of the processors are contemplated, such as digital signal processors (DSP) that are particularly useful in the processing and implementation of algorithms related to digital signals, such as voice data and communication signals, and microcontrollers that are useful in consumer applications, such as printers and copy machines.
The processor 100 operates by executing instructions on data values stored in memory. Examples of instructions that operate on data values are additions, subtractions, logical conjunctions (ANDs), logical disjunctions (ORs), and shifting and rotating binary numbers. Processor 100 may also be capable of performing other instructions, such as moving and copying data values from one memory location to another. Modern processors are capable of performing many millions of these instructions per second, the collection of which, for instance, causes a GPU to produce images for display on a computer screen or to enable the usage of a word processing program in a desktop computer.
The processor 100 includes execution units 110 which are computational cores of the processor and are responsible for executing the instructions or commands issued to the processor 100. Execution units 110 operate on data values stored in a system memory and produce results and outcomes that may be written back to memory thereafter.
Processor 100 is equipped with a load and store unit 120 that is coupled to the execution units 110, and is responsible for managing loading and storing data operated on by the execution units 110. The load and store unit 120 brings memory data to the execution units 110 to process and later store the results of these operations in memory. Processor 100 is also equipped with a Level 1 (L1) data cache 130 which stores data for access by the processor 100. L1 data cache 130 is advantageous because of the small amount of delay that a load and store unit 120 experiences in accessing its data.
In most processors it is costly (in terms of silicon design) to store all the data the processor operates on in easily-accessible L1 caches. Processors usually have a hierarchy of memory storage locations. Small but fast storage locations are expensive to implement but offer fast memory access, while large but slower storage locations are cheaper to implement, but offer slower memory access. A processor has to wait to obtain data from these large storage locations and therefore its performance is slowed.
FIG. 2 shows a memory hierarchy of a processor, such as processor 100. Registers represent the fastest memory to access, however, in some instances they may only provide 100 Bytes of register space. Hard drives are the slowest in term of memory access speed, but are both cheap to implement and offer very large storage space, e.g., 1 TeraByte (TB) or more. Level 1 (L1) through Level 3 (L3) caches range from several kilobytes (kBs) in size to 16 megabytes (MBs) or more, depending on the computer system.
Data stored in memory is organized and indexed by memory addresses. For instance, addressing 4 kB of data requires 4*1024=4096 distinct memory addresses, where each memory address holds a Byte (eight bits or an octet) of data. Therefore, to completely reference the memory addresses of a 4 kB memory, a minimum of 12 bits are required. Processors also use a system of paging in addressing memory locations, where memory is sectioned in pages of memory addresses. For instance, a processor may use a 4 kB page system in sectioning memory and therefore may be able to point to a memory location within a page using 12 bits. On the other hand, a page may be comprised of 1 MegaByte (MB) of data in which case, 20 bits are required to point to each of the 1048576 (1024*1024) distinct addresses within the page.
Further, many pages may be indexed in order to completely cover the memory locations that are accessible to the processor. For instance, if the processor memory hierarchy includes 256 GigaBytes (GB) of data and a 4 kB paging system is used, then the memory system comprises 256*1024*256 which is 67108864 pages. Therefore, 8+10+8=26 bits are further required to identify each of the 67108864 pages in the memory system. FIG. 3 graphically illustrates this example, where a 38-bit memory address comprises a 26-bit page address and a 12-bit Byte index within the page. This memory address of FIG. 3 is hereinafter referred to as a physical address (PA), to be distinguished from a linear address (LA) or a virtual address (VA). As will be described herein, a PA format is an external format, whereas a LA format is an internal processor address format.
It is desirable to have a method and an apparatus that efficiently translates LAs to PAs. It is also desirable to have a memory address translation device, such as a Translation Look-aside Buffer (TLB), that translates LAs to PAs in a power-efficient way.