Conventionally, a processor employing virtual storage includes a TLB (translation lookaside buffer), which is a cache memory dedicated to holding a copy of a page table managed in an operating system (hereinafter referred to as “OS”) in order to perform high-speed address translation from a virtual address space, which is an address space unique to a process, to a real address space, which is an address space of the entire computer system including the processor.
Meanwhile, in order to hide memory access latency, a processor copies data in a memory to a cache memory (hereinafter also referred to as “cache”) to use the data. In order to identify the address of data in the memory, a copy of which has been hold in the cache, in addition to a data memory configured to hold data in the memory, the processor includes a tag memory configured to store the addresses of the data and the states of the data (e.g., whether or not the data is valid, and whether or not the memory content has been updated). In general, a tag memory is configured to use low-order bits of a memory address as an index for a cache, and hold high-order bits (tag) of the memory address and the state of data as data.
The aforementioned address translation is often a critical path for timing in processor designing. Where a processor employs hierarchical memories, a configuration in which a level 1 cache positioned close to the processor (hereinafter referred to as “L1 cache”) is accessed using a virtual address, and caches of level 2 (“L2 cache”) onward are accessed using a physical address for, e.g., a countermeasure for aliases, which will be described later, is often employed.
Since address translation is performed in all of memory accesses for instruction fetching, load instructions and store instructions, the effect of TLB misses imposed on the performance is larger than that of ordinary cache misses. Accordingly, a TLB is provided as a dedicated memory separately from a cache.
However, the configurations of the aforementioned conventional TLB and cache memory have the following problem.
The problem is one relating to the capacities of tag memories in a TLB and a cache.
A TLB holds data such as virtual page numbers and physical page numbers, and page attributes and page states as its data. A processor having a physical address size of 32 bits or more has a large percentage of virtual page numbers and physical page numbers in the data held by the TLB. The size of a TLB is determined mainly by the size of the physical address space, the minimum page size, and the number of entries in the TLB.
A tag memory of a cache holds data such as tags and the cache states as its data. A processor having a physical address size of 32 bits or more has a large percentage of tags in the data held by the tag memory of the cache. The size of a tag memory of a cache is determined mainly by the size of the physical address space, the cache line size, and the cache capacity.
According to FIG. 5.28 (p. 341) in “Computer Architecture—A Quantitative Approach—Forth Edition”, in a recent processor,                an L1 cache is a 2-way set associative cache with a size of 8 to 64 KB        a TLB is a full associative cache with an entry count of 40 to 1024; and        the minimum page size is 4 to 64 KB.See also “Integrating Virtual Memory, TLBs, and Caches” (pp. 524-527), FIG. 7.24 (p. 525), and FIG. 7.25 (p. 526) by David A. Patterson and John L. Hennessy, Computer Organization and Design—The Hardware/Software Interface—Third Edition, Morgan Kaufmann Publishers, 2007, and “Avoiding address translation during indexing of the cache to reduce hit time” (p. 291) and FIG. 5.3 (p. 292) by John L. Hennessy and David A. Patterson, Computer Architecture—A Quantitative Approach—Fourth Edition, Morgan Kaufmann Publishers, 2007.(First Problem)        
Conventionally, when task switching in which an executed task is switched to another occurs, the content of the TLB is rewritten, and processing for invalidating the cache memory is performed. Here, when data in the cache memory has been updated and a dirty bit is set, a write-back of cache data to the main memory is performed.
However, the time required for a write-back of data to the main memory is extremely long compared to time required for other processing in task switching, causing a problem in that the responsiveness of task switching in the processor deteriorates.
(Second Problem)
Furthermore, conventionally, a multiprocessor system includes a system in which data reads and writes are performed between the main memory and a cache memory in a processor, and between respective cache memories.
For example, in a multiprocessor system, when a processor writes an operation result to its own cache memory, the value of the data in the main memory corresponding to the operation result data differs from the data in the cache memory. Accordingly, when another processor refers to the operation result data, a write-back, that is, castout of the value written to the cache memory by the processor from the cache memory to the main memory is performed. As a result of the write-back being performed, correct data is stored in the main memory, enabling the operation result data to be used by another processor as well. In other words, in a multiprocessor system, in order to make data rewritten by a processor (CPU1) available to another processor (CPU2), the processor that has rewritten the data (CPU1) needs to write the data back to the main memory.
The aforementioned write-back involves access to the main memory, which results in the problems of causing a decrease in performance of the multiprocessor system due to large latency, and furthermore, causing an increase in power consumption due to an operation of the input/output circuit.
(Third Problem)
Furthermore, conventionally, DMA technique is used for data transfer between different address areas of a main memory or between the main memory and an input/output device without increasing the load of the CPU. For multiprocessor systems, a technique in which each processor performs data transfer between a main memory and its own local memory using DMA has been in practical use.
For example, a CPU of each processor, that is, a CPU core in a multiprocessor system accesses a relevant local memory according to a load instruction and a store instruction, reads data from the local memory according to the load instruction and writes the data to a register file in the CPU, and retrieves data from the register file in the CPU according to the store instruction and writes the data to the local memory. Each CPU performs a read (GET) of data to the local memory from the main memory and a write (PUT) of data from the local memory to the main memory using DMA.
DMA transfer is controlled by a DMA controller by means of designating a source address and a destination address, which are physical addresses. Furthermore, for the aforementioned multiprocessor system, DMA transfer of a cacheable area is not supported.
Accordingly, since a source address and a destination address are also designated using physical addresses in DMA, a programmer can write programs to be executed in the respective CPUs after estimating the data processing time and the data access time.
In order to increase the processing power of a processor in each CPU, a cache memory can be provided in the processor; however, accessing a main memory via a cache memory causes problems in that data access time differs between the case of a cache hit and the case of a cache miss, and the time required for transfer of cache data from the main memory in the case of a cache miss cannot be estimated.
Furthermore, in order to enhance the flexibility of program development, programs that can be executed in the aforementioned multiprocessor system can be made to be executable by using virtual addresses, which, however, causes a problem in that DMA cannot be used for a system using virtual addresses in a local memory.
(Fourth Problem)
In general, cache memories include multilevel cache memories, and some processors include multilevel caches. In that case, the storage capacity of an L1 cache is smaller than the storage capacity of an L2 cache. In general, the storage capacity of a higher-order cache is smaller than the storage capacity of a lower-order cache.
However, the hit rate of an L1 cache is generally high. When an L1 cache is accessed, a TLB is always referred to for translation from a virtual address to a physical address. Accordingly, a processor suffers large power consumption in the TLB hardware.
A present embodiment has been provided in view of the aforementioned first problem, and a first object of the present embodiment is to provide a cache memory and a processor, which provide a TLB function in the cache memory, enabling reduction of the circuit amount, and have enhanced task switching responsiveness.
Another present embodiment has been provided in view of the aforementioned second problem, and a second object of the present embodiment is to provide a multiprocessor system enabling reduction of the amount of access to a main memory based on data write-back processing performed by each processor.
A still another present embodiment has been provided in view of the third problem, and a third object of the present embodiment is to provide a processor enabling DMA to be executed using a virtual address, enhancing the cache hit rate for DMA transfer, or enhancing the cache hit rate for the case where the relevant processor accesses the cache after DMA transfer.
A still further present embodiment has been provided in view of the aforementioned fourth problem, and a forth object of the present embodiment is to provide a processor including multilevel cache memories, the processor enabling the reference frequency of a TLB to be reduced, decreasing the power consumption of the processor.