1. Field of the Invention
This invention generally relates to computer memory management and, more particularly, to a means for managing physical memory addresses in a system using multiple processors.
2. Description of the Related Art
Physical memory, as used herein, is the actual memory device(s) (DRAM, SRAM, FLASH, etc.) where data is stored. In processing, two types of information are generally stored in this memory—data and instructions. Data is the working set of constants and variables that software acts upon. Instructions are the list of commands or operations that are to be carried out on the data. Access to the memory is done through an address. Each location in memory has a unique address. The address of the actual physical devices is referred to as the physical address, or sometimes, real address.
In the early days of microprocessors, the address generated by software (SW) to access a memory location, was always a physical address. The working area that contained both the instructions and the data was called the working set. In the early days one, and only one, program would execute at a time on the computer, so operations were simple. Later, the notion of operating systems and applications was introduced. This meant that more than one SW program was resident on the computer and the processor could switch back and forth between these programs. Since multiple programs had access to all of physical memory it was possible for a bug or mistake in one program to corrupt the working set of another program. For example, if a first program made a mistake with an address pointer calculation, the wrong address might be written, perhaps overwriting the instructions for a second program. When the second program sequentially stepped through its instruction list and landed on a corrupted instruction, the computer would crash.
To get around this problem, the notion of virtual memory addressing was introduced. Each application is given a virtual address space to work within. The memory management unit and address translation mechanism permit the virtual address space to be translated to the actual physical memory where the storage of data and instructions actually exists. Alternately stated, software executes in what is called virtual address space. Each application, as well as the operating system (OS), “live” in their own virtual address map. However, the processor must ultimately use physical addresses in memory. So, an association has to be made between the virtual address space and the physical address space. The OS does this association and makes assignments to the individual applications using a memory allocation software routine.
When the system first boots, the OS builds an overall physical address map of the system. Memory is mapped in chunks called pages. A page table is built in memory by the OS with entries for each page called page table entries (PTE). Each page table entry includes the virtual page number, the associated physical page number, and any additional attribute bits related to the virtual or physical address. For example, each virtual address also includes the a Process Number associating a particular application with its physical address space.
A programmer writes their program with a specific address map in mind for the data structures to be accessed. Physical address cannot be used in the program because the programmer cannot know in advance if the addresses they might select are available or being used by another program. The memory management unit acts as a translation mechanism between the virtual address space where the program is executing and the actual physical address space where the instructions and data actually reside. As an example, when both application A and application B want to write address 0x0001 4000, a translation might be made such that the actual physical location for A is 0x00F0 1000 and for B is 0x000C 1000. This assignment of virtual to physical address translation is made by the Operating System in what is called a memory allocation routine or MALLOC.
But, if more than one operating system is being used in the system, it becomes possible for a first OS to assign the same physical address space to a first SW application, as a second OS might assign to a second application. In this circumstance, a Hypervisor and Virtualization become necessary. Now, a second level of address management software must run on the microprocessor, which assigns virtual address spaces and associated physical address translations to the individual OSs. The current art for cross-referencing virtual and physical addresses requires adding “extra” bits to the virtual side of the address, essentially expanding the virtual address. This expansion of the virtual address requires running some additional code (e.g., the Hypervisor). The advantage of this approach is that multiple OSs can then coexist on the same processor core. However, this approach does require an additional software layer (Hypervisor) to be active to manage that assignment on the virtual side.
It is not possible to use a Hypervisor if the system is using multiple heterogeneous asymmetric processors. Symmetric multiprocessing (SMP) is a system of computer architecture where two or more identical processors are connected to a single shared main (physical) memory. Further, each processor participating in the SMP system must coordinate together to manage memory. SMP systems permit any processor to work on any task no matter where the data for that task is located in memory. SMP systems can move tasks between processors to balance the workload efficiently. Asymmetric multiprocessing (AMP) refers to a system whereby multiple processors independently run operating systems with no awareness of each other. In this case there is no memory management coordination between the operating systems. Heterogeneous processors in this context are processors that have different programming models especially where memory management is concerned. Given the incompatibilities in memory management mechanisms between processors in a heterogeneous asymmetric multiprocessor, it is generally not pragmatic to use a Hypervisor.
Modern general purpose Harvard architecture processors typically include a multi-level cache hierarchy. The cache memory subsystem aids in delivering of commonly used instructions or data to the execution unit with the lowest latency possible. The average access latency is a key component to the execution performance of a software application.
The access time of a cache is based on the physical constraints of the access time of the SRAM arrays and logic associated with the cache controller. A larger cache has a physically larger array and, thus, the access latency due to lookup overhead and wire delays increases. Therefore, a processor typically has a moderately small first level cache (L1) in order to provide the best trade off in access latency vs. cache hit ratios. Subsequently, a second level cache (L2) is responsible for reducing cache miss penalty by caching a larger portion of the working set. This is done by providing a much larger cache array size, and comes with a penalty of longer access latency.
FIG. 1 is a schematic diagram depicting a plurality of processors sharing an L2 cache and main memory (prior art). It is common for systems to have software partitioned into several processes or threads. Further, it is becoming more common to break a workload or set of processes or threads across multiple processors such as in a multicore processor. In such systems, the cache hierarchy is typically shared amongst threads running on a single processor core. Further, it is often common in multicore processors to share a common L2 cache. A shared cache provides two benefits—first, data structures are shared between processors residing in a common location, thus, reducing transfer overhead from one cache to another. Secondly, not all software threads can leverage a cache equally. Some threads benefit more from a larger cache because they have a larger working set than other threads. Given that the exact workload that a processor will run in the future is not known when a processor is designed, it is usual practice to provide as large a cache as economically and physically practical. For a multicore device, an independent cache hierarchy can be provided for each processor. This cache hierarchy comes at the cost of potentially great inefficiency with respect to the resulting size, power, and cost. Instead, a shared cache (e.g., L2) is used when practical.
Certain applications require deterministic behavior as part of their operating characteristics. For example, real-time or deadline based computing often found in embedded applications requires a certain amount of computation be completed within a predetermined time period. Given a cache shared by multiple concurrent software processes, and further by multiple processors, the access latency for a thread is not guaranteed to be consistent due to the varied interactions of the other threads.
One solution has been to allow software configurable partitioning of the shared cache based on each physical processor that is sharing the cache. Such partitioning is implemented as part of the cache allocation scheme of the cache controller. For a two-CPU system, software running on CPU A is allocated use of space A in the cache, while CPU B is allocated space B. Such partitioning is very coarse and does not allow for inter-processor behaviors, especially where larger numbers of cores exist. Further, it does not address the specific behaviors and needs of different software operating on the same processor core.
The reduction in performance and access determinism is primarily due to two factors—the first is cache line replacement. This is the case where two or more threads are concurrently sharing a common cache. As these threads interact with the cache they compete for the limited resource, thus, randomly replacing cache elements that the other is potentially using, now or in the near future. In this circumstance, a change of code in one thread may adversely impact the performance of another thread.
The second item that impacts cache access latency is blocking. Blocking is the condition whereby two processors are accessing a common cache tag in order to examine if the desired cache element is currently resident in the cache. Since coherency must be maintained, one and only one access to a particular cache address can occur at a time.
FIG. 2 is a schematic diagram of a multi-processor system using an L2 cache bank (prior art). Larger shared caches have deployed the notion of cache banks. A cache of dimension X can be partitioned into N banks each of dimension Y. The banks each cache a smaller portion of the overall address space. Partitioning the cache into banks enables concurrent access. Such partitioning can be done using a low-level address-interleave. Conceptually, software randomly accesses memory locations located across the banks, thus enabling more concurrent accesses and a net reduction in average access latency.
It would be advantageous if a mechanism existed that permitted a physical memory to be efficiently shared between processors.
It would be advantageous if the above-referenced mechanism could be enabled in hardware, without implementing an additional software layer such as a Hypervisor.