Parallel computer architectures generally provide multiple processors that can each execute different tasks simultaneously. One such parallel computer architecture is referred to as a multithreaded architecture (MTA). The MTA supports not only multiple processors but also multiple streams executing simultaneously in each processor. The processors of an MTA computer are interconnected via an interconnection network. Each processor can communicate with every other processor through the interconnection network. FIG. 1 provides a high-level overview of an MTA computer. Each processor 101 is connected to the interconnection network and memory 102. Each processor contains a complete set of registers 101a for each stream. In addition, each processor also supports multiple protection domains 101b so that multiple user programs can execute simultaneously within that processor.
Each MTA processor can execute multiple threads of execution simultaneously. Each thread of execution executes on one of the 128 streams supported by an MTA processor. Every clock time period, the processor selects a stream that is ready to execute and allows it to issue its next instruction. Instruction interpretation is pipelined by the processor, the network, and the memory. Thus, a new instruction from a different stream may be issued in each time period without interfering with other instructions that are in the pipeline. When an instruction finishes, the stream to which it belongs becomes ready to execute the next instruction. Each instruction may contain up to three operations (i.e., a memory reference operation, an arithmetic operation, and a control operation) that are executed simultaneously.
The state of a stream includes one 64-bit Stream Status Word (“SSW”), 32 64-bit General Registers (“R0-R31”), and eight 32-bit Target Registers (“T0-T7”). Each MTA processor has 128 sets of SSWs, of general registers, and of target registers. Thus, the state of each stream is immediately accessible by the processor without the need to reload registers when an instruction of a stream is to be executed.
Each MTA processor supports as many as 16 active protection domains that define the program memory, data memory, and number of streams allocated to the computations using that processor. Each executing stream is assigned to a protection domain, but which domain (or which processor, for that matter) is assigned need not be known by the user program.
The MTA divides memory into program memory, which contains the instructions that form the program, and data memory, which contains the data of the program. The MTA uses a program mapping system and a data mapping system to map addresses used by the program to physical addresses in memory. The mapping systems use a program page map and a data segment map. The entries of the data segment map and program page map specify the location of the segment in physical memory along with the level of privilege needed to access the segment.
Each memory location in an MTA computer has four access state bits in addition to a 64-bit value. These access state bits allow the hardware to implement several useful modifications to the usual semantics of memory reference. These access state bits are two data trap bits, one full/empty bit, and one forward bit. The two data trap bits allow for application-specific lightweight traps, the forward bit implements invisible indirect addressing, and the full/empty bit is used for lightweight synchronization. The behavior of these access state bits can be overridden by a corresponding set of bits in the pointer value used to access the memory. The two data trap bits in the access state are independent of each other and are available for use, for example, by a language implementer. If a trap bit is set in a memory location, then an exception will be raised whenever that location is accessed if the trap bit is not disabled in the pointer. If the corresponding trap bit in the pointer is not disabled, then a trap will occur.
The forward bit implements a kind of “invisible indirection.” Unlike normal indirection, forwarding is controlled by both the pointer and the location pointed to. If the forward bit is set in the memory location and forwarding is not disabled in the pointer, the value found in the location is interpreted as a pointer to the target of the memory reference rather than the target itself. Dereferencing continues until either the pointer found in the memory location disables forwarding or the addressed location has its forward bit cleared.
The full/empty bit supports synchronization behavior of memory references. The synchronization behavior can be controlled by the full/empty control bits of a pointer or of a load or store operation. The four values for the full/empty control bits are shown below.
VALUEMODELOADSTORE0normalread regardlesswrite regardlessand set full1Reservedreserved2futurewait for fullwait for fulland leave fulland leave full3syncwait for fullwait for emptyand set emptyand set fullWhen the access control mode (i.e., synchronization mode) is future, loads and stores wait for the full/empty bit of memory location to be accessed to be set to full before the memory location can be accessed. When the access control mode is sync, loads are treated as “consume” operations and stores are treated as “produce” operations. A load waits for the full/empty bit to be set to full and then sets the full/empty bit to empty as it reads, and a store waits for the full/empty bit to be set to empty and then sets the full/empty bit to full as it writes. A forwarded location (i.e., its forward bit is set) that is not disabled (i.e., by the access control of a pointer) and that is empty (i.e., the full/empty bit is set to empty) is treated as “unavailable” until its full/empty bit is set to full, irrespective of access control.
The full/empty bit may be used to implement arbitrary indivisible memory operations. The MTA also provides a single operation that supports extremely brief mutual exclusion during “integer add to memory.” The FETCH_ADD operation loads the value from a memory location and stores the sum of that value and another value back into the memory location.
Conventional computer systems provide memory allocation techniques that allow programs or applications to allocate and de-allocate (i.e., free) memory dynamically. To allocate a block of memory, a program invokes a memory allocation routine (e.g., “malloc”) passing the size of the requested block of memory. The memory allocation routine locates a free block of memory, which is usually stored in a “heap,” marks the block as being allocated, and returns to the program a pointer to the allocated block of memory. The program can then use the pointer to store data in the block of memory. When the program no longer needs that block of memory, the program invokes a memory free routine (e.g., “free”) passing a pointer to the block of memory. The memory free routine marks the block as free so that it can be allocated to a subsequent request.
A program executing on a single-threaded processor may have multiple threads that execute concurrently, but not simultaneously. Each of these threads may request that memory be allocated or freed. Conventional memory allocation techniques, however, do not support the concurrent execution of memory allocation or memory free routines. If such routines were executed concurrently, a thread may find the state of the data structures used when allocating and freeing memory to be inconsistent because another thread is in the process of updating the state. Conventional memory allocation techniques may use a conventional locking mechanism (e.g., a semaphore) to prevent the concurrent execution of the memory allocation and memory free routines. Thus, the locked-out threads will wait until another thread completes its memory allocation. Such waiting may be acceptable in a single-threaded processor environment because only one thread can be executing at any one time, so the processor may be always kept busy. Such waiting, however, is unacceptable in a multithreaded processor environment because many streams of the processor may be left idle waiting for a thread executing on another stream to complete its memory allocation request.
FIG. 2 is a block diagram that illustrates the layout of a word of memory and in particular a pointer stored in a word of memory. Each word of memory contains a 64-bit value and a 4-bit access state. The 4-bit access state is described above. When the 64-bit value is used to point to a location in memory, it is referred to a “pointer.” The lower 48 bits of the pointer contains the address of the memory location to be accessed, and the upper 16 bits of the pointer contain access control bits. The access control bits indicate how to process the access state bits of the addressed memory location. One forward disable bit indicates whether forwarding is disabled, two full/empty control bits indicate the synchronization mode; and four trap 0 and 1 disable bits indicate whether traps are disabled for stores and loads, separately. If the forward disable bit is set, then no forwarding occurs regardless of the setting of the forward enable bit in the access state of the addressed memory location. If the trap 1 store disable bit is set, then a trap will not occur on a store operation, regardless of the setting of the trap 1 enable bit of the access state of the addressed memory location. The trap 1 load disable, trap 0 store disable, and trap 0 load disable bits operate in an analogous manner. Certain operations include a 5-bit access control operation field that supersedes the access control field of a pointer. The 5-bit access control field of an operation includes a forward disable bit, two full/empty control bits, a trap 1 disable bit, and a trap 0 disable bit. The bits effect the same behavior as described for the access control pointer field, except that each trap disable bit disables or enables traps on any access and does not distinguish load operations from store operations.
Conventional memory allocation routines are typically optimized to allocate memory based on the expected allocation patterns of the programs. For example, if it is expected that the programs will allocate many small blocks of memory, the memory allocation routines are optimized to allocate small blocks of memory efficiently. If, however, a program requests that a large block of memory be allocated, it may be very inefficient to service the request because, for example, it may be necessary to coalesce many free small blocks of memory into a single block of memory large enough to satisfy the request. Conversely, a conventional memory allocation routine may be optimized to allocate large blocks of memory efficiently. In such a case, it may be very efficient to allocate large blocks of memory but inefficient either computationally or in memory usage to allocate many small blocks.
Memory allocation from a shared and unpartitioned global address space to support parallel applications is a fundamental need of shared-memory parallel computers. In the absence of faster processors, one way to increase application performance is by increasing concurrency. As a result of increased concurrency, future applications will likely have an ever-increasing number of threads allocating memory concurrently. Even where applications themselves do not perform a great deal of dynamic memory allocation, the mechanisms supporting parallelization such as stack allocation and thread creation do perform such memory allocation.
Hoard is an allocator that attempts to attain scalable and memory-efficient allocator performance as described in Berger, E., McKinley, K., Blumofe, R., and Wilson, P., “Hoard: A Scalable Memory Allocator for Multithreaded Applications,” Proceedings of ASPLOS '00, 2000. Unfortunately, Hoard cannot be scaled up without impractical expenditure of space. Hoard instantiates a number (e.g., two) of “local” allocators per processor. In addition, Hoard instantiates a global pool managed by a global allocator. There is an affinity of allocators to processors in that each processor first tries to allocate from one of its local allocators; if that allocator is empty, it tries to refill from the global allocator; if that fails, it allocates a new block from the operating system.
According to Hoard, each heap is specific to a range of request sizes, which implies internal fragmentation at a rate proportional to the ratio of successive ranges. This internal fragmentation seems to be acceptable in practice. Requests larger than half of an implementation specific block-size are directed to the operating system. Each heap manages its list of blocks, never touching blocks belonging to other heaps. However, when its blocks are sparsely utilized, heaps move an unused or nearly unused block into the global pool. Once a block in the global pool is completely unused, it becomes available to all heaps.
A lock-free approach is described in Michael, M., “Scalable Lock-Free Dynamic Memory Allocation,” Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, Washington, D.C., Jun. 9-11, 2004, that is similar to Hoard in that it also uses repeated heaps of super blocks in proportion to the required concurrency backed by a global pool implemented as a shared FIFO of super blocks. Again, larger requests are directed to the operating system. Its lock-free approach, however, provides robustness against deadlock should threads fail to progress.
Memory allocation for the MTA was initially implemented in a similar spirit to Hoard; that is, concurrency is by virtue of repeated data structures using various forms of locking to ensure atomicity. This implementation, however, may suffer contention under allocation surges, suggesting that it would not scale with system size.
These parallel memory allocators rely on heap repetition to support concurrency. When individual heaps are exhausted, they can be refilled from the next level in a small hierarchy. In some instances, the first level is the heap itself, the second is a global pool or FIFO, and the third is the OS.