The present invention relates generally to the field of memory allocation and, in particular, to the field of memory allocation in a multiprocessing environment.
Parallel computer architectures generally provide multiple processors that can each be executing different tasks simultaneously. One such parallel computer architecture is referred to as a multithreaded architecture (MTA). The MTA supports not only multiple processors but also multiple streams executing simultaneously in each processor. The processors of an MTA computer are interconnected via an interconnection network. Each processor can communicate with every other processor through the interconnection network. FIG. 1 provides a high-level overview of an MTA computer. Each processor 101 is connected to the interconnection network and memory 102. Each processor contains a complete set of registers 101a for each stream. In addition, each processor also supports multiple protection domains 101b so that multiple user programs can be executing simultaneously within that processor.
Each MTA processor can execute multiple threads of execution simultaneously. Each thread of execution executes on one of the 128 streams supported by an MTA processor. Every clock time period, the processor selects a stream that is ready to execute and allows it to issue its next instruction. Instruction interpretation is pipelined by the processor, the network, and the memory. Thus, a new instruction from a different stream may be issued in each time period without interfering with other instructions that are in the pipeline. When an instruction finishes, the stream to which it belongs becomes ready to execute the next instruction. Each instruction may contain up to three operations (i.e., a memory reference operation, an arithmetic operation, and a control operation) that are executed simultaneously.
The state of a stream includes one 64-bit Stream Status Word (xe2x80x9cSSWxe2x80x9d), 32 64-bit General Registers (xe2x80x9cR0-R31xe2x80x9d), and eight 32-bit Target Registers (xe2x80x9cT0-T7xe2x80x9d). Each MTA processor has 128 sets of SSWs, of general registers, and of target registers. Thus, the state of each stream is immediately accessible by the processor without the need to reload registers when an instruction of a stream is to be executed.
The MTA uses program addresses that are 32 bits long. The lower half of an SSW contains the program counter (xe2x80x9cPCxe2x80x9d) for the stream. The upper half of the SSW contains various mode flags (e.g., floating point rounding, lookahead disable), a trap disable mask (e.g., data alignment and floating point overflow), and the four most recently generated condition codes. The 32 general registers are available for general-purpose computations. Register R0 is special, however, in that it always contains a 0. The loading of register R0 has no effect on its contents. The instruction set of the MTA processor uses the eight target registers as branch targets. However, most control transfer operations only use the low 32 bits to determine a new program counter. One target register (T0) points to the trap handler, which may be an unprivileged program. When a trap occurs, the trapping stream starts executing instructions at the program location indicated by register T0. Trap handling is lightweight and independent of the operating system and other streams. A user program can install trap handlers for each thread to achieve specific trap capabilities and priorities without loss of efficiency.
Each MTA processor supports as many as 16 active protection domains that define the program memory, data memory, and number of streams allocated to the computations using that processor. Each executing stream is assigned to a protection domain, but which domain (or which processor, for that matter) need not be known by the user program.
The MTA divides memory into program memory, which contains the instructions that form the program, and data memory, which contains the data of the program. The MTA uses a program mapping system and a data mapping system to map addresses used by the program to physical addresses in memory. The mapping systems use a program page map and a data segment map. The entries of the data segment map and program page map specify the location of the segment in physical memory along with the level of privilege needed to access the segment.
The number of streams available to a program is regulated by three quantities slim, scur, and sres associated with each protection domain. The current numbers of streams executing in the protection domain is indicated by scur: it is incremented when a stream is created and decremented when a stream quits. A create can only succeed when the incremented scur does not exceed sres, the number of streams reserved in the protection domain. The operations for creating, quitting, and reserving streams are unprivileged. Several streams can be reserved simultaneously. The stream limit slim is an operating system limit on the number of streams the protection domain can reserve.
When a stream executes a CREATE operation to create a new stream, the operation increments scur, initializes the SSW for the new stream based on the SSW of the creating stream and an offset in the CREATE operation, loads register (T0), and loads three registers of the new stream from general purpose registers of the creating stream. The MTA processor can then start executing the newly created stream. A QUIT operation terminates the stream that executes it and decrements both sres and scur. A QUIT_PRESERVE operation only decrements scur, which gives up a stream without surrendering its reservation.
The MTA supports four levels of privilege: user, supervisor, kernel, and IPL,. The IPL level is the highest privilege level. All levels use the program page and data segment maps for address translation, and represent increasing levels of privilege. The data segment map entries define the minimum levels needed to read and write each segment, and the program page map entries define the exact level needed to execute from each page. Each stream in a protection domain may be executing at a different privileged level.
Two operations are provided to allow an executing stream to change its privilege level. A xe2x80x9cLEVEL_ENTER levxe2x80x9d operation sets the current privilege level to the program page map level if the current level is equal to lev. The LEVEL_ENTER operation is located at every entry point that can accept a call from a different privilege level. A trap occurs if the current level is not equal to lev. The xe2x80x9cLEVEL_RETURN levxe2x80x9d operation is used to return to the original privilege level. A trap occurs if lev is greater than the current privilege level.
An exception is an unexpected condition raised by an event that occurs in a user program, the operating system, or the hardware. These unexpected conditions include various floating point conditions (e.g., divide by zero), the execution of a privileged operation by a non-privileged stream, and the failure of a stream create operation. Each stream has an exception register. When an exception is detected, then a bit in the exception register corresponding to that exception is set. If a trap for that exception is enabled, then control is transferred to the trap handler whose address is stored in register T0. If the trap is currently disabled, then control is transferred to the trap handler when the trap is eventually enabled assuming that the bit is still set in the exception register. The operating system can execute an operation to raise a domain_signal exception in all streams of a protection domain. If the trap for the domain_signal is enabled, then each stream will transfer control to its trap handler.
Each memory location in an MTA computer has four access state bits in addition to a 64-bit value. These access state bits allow the hardware to implement several useful modifications to the usual semantics of memory reference. These access state bits are two data trap bits, one full/empty bit, and one forward bit. The two data trap bits allow for application-specific lightweight traps, the forward bit implements invisible indirect addressing, and the full/empty bit is used for lightweight synchronization. The behavior of these access state bits can be overridden by a corresponding set of bits in the pointer value used to access the memory. The two data trap bits in the access state are independent of each other and are available for use, for example, by a language implementer. If a trap bit is set in a memory location, then an exception will be raised whenever that location is accessed if the trap bit is not disabled in the pointer. If the corresponding trap bit in the pointer is not disabled, then a trap will occur.
The forward bit implements a kind of xe2x80x9cinvisible indirection.xe2x80x9d Unlike normal indirection, forwarding is controlled by both the pointer and the location pointed to. If the forward bit is set in the memory location and forwarding is not disabled in the pointer, the value found in the location is interpreted as a pointer to the target of the memory reference rather than the target itself. Dereferencing continues until either the pointer found in the memory location disables forwarding or the addressed location has its forward bit cleared.
The full/empty bit supports synchronization behavior of memory references. The synchronization behavior can be controlled by the full/empty control bits of a pointer or of a load or store operation. The four values for the full/empty control bits are shown below.
When the access control mode (i.e., synchronization mode) is future, loads and stores wait for the full/empty bit of memory location to be accessed to be set to full before the memory location can be accessed. When the access control mode is sync, load are treated as xe2x80x9cconsumexe2x80x9d operations and stores are treated as xe2x80x9cproducexe2x80x9d operations. A load waits for the full/empty bit to be set to full and then sets the full/empty bit to empty as it reads, and a store waits for the full/empty bit to be set to empty and then sets the full/empty bit to full as it writes. A forwarded location (i.e., its forward bit is set) that is not disabled (i.e., by the access control of a pointer) and that is empty (i.e., full/empty bit is set to empty) is treated as xe2x80x9cunavailablexe2x80x9d until its full/empty bit is set to full, irrespective of access control.
The full/empty bit may be used to implement arbitrary indivisible memory operations. The MTA also provides a single operation that supports extremely brief mutual exclusion during xe2x80x9cinteger add to memory.xe2x80x9d The FETCH_ADD operation loads the value from a memory location and stores the sum of that value and another value back into the memory location.
Each protection domain has a retry limit that specifies how many times a memory access can fail in testing full/empty bit before a data blocked exception is raised. If the trap for the data blocked exception is enabled, then a trap occurs. The trap handler can determine whether to continue to retry the memory access or to perform some other action. If the trap is not enabled, then the next instruction after the instruction that caused the data blocked exception is executed.
FIG. 2A illustrates the layout of the 64-bit exception register. The upper 32-bits contain the exception flags, and the lower 32 bits contain poison bits. There is one poison bit for each general register. When a poison bit is set, an exception is raised when the contents of that general register is accessed. The poison bits are used primarily for speculative loads. Bits 40-44 contain the flags for the user exceptions, which include a create stream exception, a privileged instruction exception, a data alignment exception, and a data blocked exception. A data blocked exception is raised when a data memory retry exception, a trap 0 exception, a trap 1 exception, or a long memory latency timeout is generated. The program handling a data blocked exception is responsible for determining the cause of the data blocked exception. The exception register contains one poison bit for each of the 32 general registers. If the poison bit is set, then an attempt to access the content of the corresponding register will raise an exception.
FIG. 2B illustrates the layout of the 64-bit stream status word. The lower 32 bits contain the program counter, bits 32-39 contain mode bits, bits 40-51 contain a trap mask, and bits 52-63 contain the condition codes of the last four instructions executed. Bit 37 within the mode bits indicates whether speculative loads are enabled or disabled. Bit 48 within the trap mask indicates whether a trap on a user exception is enabled (bits 40-44 of the SSW). Thus, traps for the user exceptions are enabled or disabled as a group.
FIG. 2C illustrates the layout of a word of memory and in particular a pointer stored in a word of memory. Each word of memory contains a 64-bit value and a 4-bit access state. The 4-bit access state is described above. When the 64-bit value is used to point to a location in memory, it is referred to a xe2x80x9cpointer.xe2x80x9d The lower 48 bits of the pointer contains the address of the memory location to be accessed, and the upper 16 bits of the pointer contain access control bits. The access control bits indicate how to process the access state bits of the addressed memory location. One forward disable bit indicates whether forwarding is disabled, two full/empty control bits indicate the synchronization mode; and four trap 0 and 1 disable bits indicate whether traps are disabled for stores and loads, separately. If the forward disable bit is set, then no forwarding occurs regardless of the setting of the forward enable bit in the access state of the addressed memory location. If the trap 1 store disable bit is set. then a trap will not occur on a store operation, regardless of the setting of the trap 1 enable bit of the access state of the addressed memory location. The trap 1 load disable, trap 0 store disable, and trap 0 load disable bits operate in an analogous manner. Certain operations include a 5-bit access control operation field that supersedes the access control field of a pointer. The 5-bit access control field of an operation includes a forward disable bit, two full/empty control bits, a trap 1 disable bit, and a trap 0 disable bit. The bits effect the same behavior as described for the access control pointer field, except that each trap disable bit disables or enables traps on any access and does not distinguish load operations from store operations.
When a memory operation fails (e.g., synchronized access failure), an MTA processor saves the state of the operation. A trap handler can access that state. That memory operation can be redone by executing a redo operation (i.e., DATA_OP_REDO) passing the saved state as parameters of the operation. After the memory operation is redone (assuming it does not fail again), the trapping stream can continue its execution at the instruction after the trapping instruction.
The appendix contains the xe2x80x9cPrinciples of Operationxe2x80x9d of the MTA, which provides a more detailed description of the MTA.
Conventional computer systems provide memory allocation techniques that allow programs to allocate and de-allocate (i.e., free) memory dynamically. To allocate a block of memory, a program invokes a memory allocation routine (e.g., xe2x80x9cmallocxe2x80x9d) passing the size of the requested block of memory. The memory allocation routine locates a free block of memory, which is usually stored in a xe2x80x9cheap,xe2x80x9d marks the block as being allocated, and returns to the program a pointer to the allocated block of memory. The program can then use the pointer to store data in the block of memory. When the program no longer needs that block of memory, the program invokes a memory free routine (e.g., xe2x80x9cfreexe2x80x9d) passing a pointer to the block of memory. The memory free routine marks the block as free so that it can be allocated to a subsequent request.
A program executing on a single-threaded processor may have multiple threads that execute concurrently, but not simultaneously. Each of these threads may request that memory be allocated or freed. Conventional memory allocation techniques, however, do not support the concurrent execution of memory allocation or memory free routines. If such routines were executed concurrently, a thread may find the state of the data structures used when allocating and freeing memory to be inconsistent because another thread is in the process of updating the state. Conventional memory allocation techniques may use a conventional locking mechanism (e.g., a semaphore) to prevent the concurrent execution of the memory allocation and memory free routines. Thus, the locked out threads will wait until another thread completes its memory allocation. Such waiting may be acceptable in a single-threaded processor environment, because only one thread can be executing at anytime so the processor may be always kept busy. Such waiting, however, is unacceptable in a multithreaded processor environment because many streams of the processor may be left idle waiting for a thread executing on another stream to complete its memory allocation request.
Conventional memory allocation routines are typically optimized to allocated memory based on the expected allocation patterns of the programs. For example, if it is expected that the programs will allocate many small blocks of memory, the memory allocation routines are optimized to allocate small blocks of memory efficiently. If, however, a program requests that a large block of memory be allocated, it may be very inefficient to service the request because, for example, it may be necessary to coalesce many free small blocks of memory into a single block of memory large enough to satisfy the request. Conversely, a conventional memory allocation routine may be optimized to allocate large blocks of memory efficiently. In such a case, it may be very efficient to allocate large blocks of memory but inefficient either computationally or in memory usage to allocate many small blocks.
It would be desirable to have a memory allocation technique that would maximize the concurrent execution of memory allocation routines and optimize the allocation of both large and small blocks of memory.
Embodiments of the present invention provide a method and system for allocating memory. The computer system on which the memory allocation system executes supports the simultaneous execution of multiple threads. Under control of a thread, the memory allocation system first identifies a bin associated with blocks (xe2x80x9clockersxe2x80x9d) of memory large enough to satisfy a memory allocation request. When the identified bin has a free locker, the memory allocation system searches a circular list of headers associated with the identified bin for a collection of lockers (xe2x80x9cwarehousexe2x80x9d) that contains a locker that is available to be allocated. The memory allocation system allocates the found available locker to satisfy the request. If, however, the allocated bin has no free lockers, the memory allocation system allocates a warehouse with lockers large enough to satisfy the memory allocation request. The memory allocation system then adds a warehouse header for the allocated warehouse to a circular list of warehouse headers associated with the identified bin. The memory allocation system allocates a locker from the allocated warehouse to satisfy the memory allocation request.
In another aspect of the present invention, a technique in a computer system is provided for removing an item from a circular list that is simultaneously accessible by multiple threads of execution. Each item in the circular list points to a next item in the circular list. I)during execution of one thread, the technique identifies an item to be removed from the circular list. The technique then sets the item before the identified item to point to the item after the identified item. The technique then ensures that the identified item points to an item of the circular list so that when another thread accesses the identified item after the identified item has been removed from the circular list, the identified item still points to an item on the circular list.
In another aspect of the present invention, a method in a computer system is provided for detecting unauthorized access of a first word of memory. The technique establishes forwarding for the first word of memory (e.g., by setting the forward bit) and sets the first word of memory to point to a second word of memory. The second word of memory is a valid memory location. The technique establishes forwarding for the second word of memory and sets the second word of memory to point to an invalid memory location. When the first word is accessed with forwarding enabled, the access is forwarded to the second word. The access to the second word is in turn forwarded to the invalid memory location and unauthorized access to the first word is indicated. When the first word is accessed with forwarding disabled, the pointer to the second word of memory is retrieved and can be used to further access memory in an authorized manner.