Processors are now transitioning from multi-core (e.g., few cores) to many-core (e.g., hundreds of cores). In particular, many-core processors are finding specialized applications where processing on large chunks of data can be carried out in a massively parallel configuration. Scaling a memory architecture to accommodate many-core processing systems can be challenging. Further, maintaining an appearance of a single unified memory architecture, when multiple memory subsystems are involved, may be expensive. One problem that contributes to the expense is that the power and performance overheads of automatic memory management in hardware, such as caches, is becoming prohibitive. Caches consume about half of the processor energy on a single-core processor and consume an even larger fraction in multi-core and many-core processors. Another problem that contributes to the expense is that cache coherency protocols do not scale well to hundreds and thousands of cores.
One conventional solution to improve memory systems in many-core processing systems is the use of limited local memory (LLM) architectures. FIG. 1 is a block diagram illustrating a conventional limited local memory (LLM) architecture. One example of an LLM architecture, such as shown in FIG. 1, is the IBM Cell Broadband Engine. An LLM architecture 100 may include a 9-core processor, with one main core 112 (the Power Processing Element or PPE) and eight execution cores 102 (the Synergistic Processing Elements or SPEs). The main core 112 in the cell processor may be a two-way simultaneous multi-threaded core, and each of the execution cores 102 may work on only one thread at a time in a non-pre-emptive fashion. The main core 112 may execute the operating system, and the main core 112 may have direct access to the global memory through a coherent L2 cache, while each execution core 102 may have a local store memory 106 having, for example, 256 KB of available space. Data communications between the local memory on the execution core and the global memory may be explicitly managed in the software through direct memory access (DMA) engine 108. The DMA engine 108 may have access to an interconnect bus with a 128-byte width, up to or more than 300 GB/s capacity, and/or have a 100-deep request queue.
In an LLM architecture, each core of a many-core processor has a small local memory. Each core has access to only the small memory, but transfers between local and global memory have to be explicitly specified in the application code. The explicit transfer requirement presents challenges to developing application for many-core processors with LLM. One challenge is for applications to be rewritten in a parallelized fashion to operate on the many-core processor. A second challenge is to efficiently execute the application in a threaded manner with the limited memory. The application, and the data accessed by the application, is stored in and executed within the limited memory available. Heap data, in particular, is dynamic in nature and may not be known at compile time, increasing the difficulty of writing application for the limited memory. Heap data may overwrite stack data inadvertently during execution and cause program failures, such as an application crash, entering an infinite loop, or generating an incorrect result.
Programming on an LLM architecture, such as shown in FIG. 1, may be based on a message passing interface (MPI)-style thread model. A main controller thread creates and distributes data and tasks, and may also collect results from the execution threads. The main thread runs on the main core, while the execution threads are scheduled on the execution cores. A very simple application in this multicore programming paradigm is illustrated in FIG. 2. FIG. 2 is pseudocode for programming an LLM architecture, such as FIG. 1. In the pseudocode, the main thread, executing on the main core, initiates several execution threads on the execution cores. In the execution core, thread student data structures are initialized and operated on. The student data structure contains two fields, id (int) and score (float), for each student.
Normally, the local memory on the execution core is divided into three segments by the software: the text region (program code and data), heap variable region, and stack variable region. The text region is where the compiled code of the program itself resides. The function frames reside in the stack space, which starts from the top of the memory, growing downwards, while the heap variables (defined through a malloc command) are allocated in the heap region starting from the top of the code region and growing upwards. The three segments share the local store, and because the local store is a constrained resource and lacks any hardware protection, heap data can easily overflow into the stack region and corrupt the program state.
In the pseudocode of FIG. 2, for small values of N, the program will execute fine, but large values of N may cause catastrophic failures. However, even worse is when output is just subtly incorrect. One way to avoid these problems, is to avoid using heap variables. However, this approach is very limiting on both the creativity and productivity of the programmer.
One conventional method of managing heap data in local memory in an LLM processor is through the use of software cache. FIG. 3 is source code illustrating managing heap data of an application, such as shown in FIG. 2, through a software cache. A software cache is a semi-automatic way to manage large amounts of data in a constant amount of local memory space. A software cache data structure may be located in the execution core with a predetermined size in the global data segment. To use software cache, an application may include a declaration to manage certain data structure through software cache and the application may then replace every access of that data by a read/write from the software cache. Software cache access first checks whether the data is in the cache data structure on the local memory or not. If it is, then the program can directly read/write the data from/to the cache, otherwise, a direct memory access (DMA), is performed to retrieve the required data from the global memory and store the data in the local memory, where it can be accessed. As new data comes in the cache data structure, older data may be evicted out to the main memory.
FIG. 3 gives an example of how the heap data of the application described in FIG. 2 may be managed through software cache. The first line of FIG. 3B shows the declaration of the software cache named HEAP. Because the number of students is unknown and can be large, the student data structures must be allocated in the global memory. However, in the original code, the student data structures are malloc-ed in the execution thread. Therefore, there is a need to communicate memory requirements from the execution threads/cores to the main thread/core. This requires a change in the structure of the multi-threaded program.
In the example shown in FIG. 3, the execution thread/core (SPE) sends the size of malloc to the main thread/core (PPE) through a mailbox. The main thread/core (PPE) allocates space for the student data structure in the global memory and sends its address back to the execution thread/core (SPE). The execution core (SPE) uses this address to access the student data structure that actually resides in the global memory, through software cache. To enable this scheme, a new thread on the main core (PPE), heapManage, may be initiated, which waits for requests from the execution thread/core (SPE), allocates the requested data structure in global memory heap, and sends back the allocated address to the execution thread/core (SPE). Similar steps are taken when free-ing up the allocated memory, but are skipped for simplicity in the example.
One complexity with the software caching of heap data is that the interface of the software cache requires that the data should be allocated on main core, and the execution cores must access the data using the global address. To use software cache, if an execution thread/core allocates/frees certain variables (using malloc/free), then these allocation requests must be transmitted to the main core. Users have to program this communication and allocation/free manually. In addition, to enable that main core handle the execution thread memory management requests, users have to manually create a new thread, which will wait and serve requests from execution threads. Normally the execution cores do the bidding of the main core, but to support this heap management the main core serves the execution core requests. This reversal of roles makes this programming non-intuitive and complicated.
A second complexity with software caching of heap data is that the software cache library only supports one data type in a cache. Software cache does not support, for example, both an integer element and a pointer element, and it must be renamed as any other non-structure and non-pointer data type. This has to be done because the weight is int, and should be changed to integer for the purpose that the two element can use one cache instead of two different caches. This is un-natural for C programming and severely reduces readability.
A third complexity with software caching of heap data is that even if the data is in the cache, we still need to use cache functions cache_rd and cache_wr to access data from software cache. The programmer cannot avoid looking up and therefore there is little scope for optimization on the management overhead.
Software cache is best suited to handling global data, which is declared and allocated once. Because heap data is allocated dynamically, software caching of heap data is inefficient. Software caching of heap data would require changes in application coding and changing the thread on the main core of the many-core processor system. Further, software caching is difficult to implement and debug as the number of processors increases. What is needed is a scheme that limited local memory (LLM) multi-core programmers and applications can use to efficiently and intuitively manage heap memory of the application.