1. Technical Field
This invention relates to memory allocation in a computer system. More specifically, this invention relates to memory allocation in the environment of a shared memory multiprocessor computer system.
2. Description of the Prior Art
Multiprocessor systems by definition contain multiple processors, also referred to herein as CPUs that can execute multiple processes or multiple threads within a single process simultaneously in a manner known as parallel computing. In general, multiprocessor systems execute multiple processes or threads faster than conventional uniprocessor systems, such as personal computers (PCs), that execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded process and/or multiple distinct processes can be executed in parallel and the architecture of the particular multiprocessor system at hand.
The architecture of shared memory multiprocessor systems may be classified by how their memory is physically organized. In distributed shared memory (DSM) machines, the memory is divided into modules physically placed near one or more processors, typically on a processor node. Although all of the memory modules are globally accessible, a processor can access local memory on its node faster than remote memory on other nodes. Because the memory access time differs based on memory location, such systems are also called non-uniform memory access (NUMA) machines. In centralized shared memory machines, on the other hand, the memory is physically in one location. Centralized shared memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time for each of the processors. Both forms of memory organization typically use high-speed caches in conjunction with main memory to reduce execution time.
The use of NUMA architecture to increase performance is not restricted to NUMA machines. A subset of processors in an UMA machine may share a cache. In such an arrangement, even though he memory is equidistant from all processors, data can circulate among the cache-sharing processors faster (i.e. with lower latency) than among the other processors in the machine. Algorithms that enhance the performance of NUMA machines can thus be applied to any multiprocessor system that has a subset of processors with lower latencies. These include not only the noted NUMA and shared-cache machines, but also machines where multiple processors share a set of bus-interface logic as well as machines with interconnects that xe2x80x9cfan outxe2x80x9d (typically in hierarchical fashion) to the processors.
Multiprocessor computers with distributed shared memory are often organized into multiple nodes with one or more processors per node. The nodes interface with each other through a memory interconnect network by using a protocol.
In a NUMA machine, the memory is physically closer to a processor on the same node than a processor on another node. Consequently, processes run faster if their memory is placed on the node containing the processor running that process since the processor and memory would not need to communicate between nodes. In contrast, in a UMA machine the memory is substantially equidistant from all processors, and there is no performance advantage to placing a process""s memory in any particular range of physical addresses.
In a multiprocessor computing environment it is often necessary to provide a system for preventing any given type of structure from consuming all of the memory. This system requires that a count of the number of structures allocated be maintained to ensure that the value of the cache in each CPU is maintained within a predetermined range. Distributed reference counters have been implemented in the prior art multiprocessor computer systems to address the issue of cache overflow and cache underflow. The distributed reference counters are used to track cache memory accesses by each CPU and each quad to make sure a cache overflow situation does not occur. FIG. 3 is a diagram of the data structure 50 of the prior art distributed reference counter. As shown in FIG. 3, the first field 52 is divided into three sections 54, 56 and 58. The first section 54 is a lock for the data structure, the second section 56 is the identifier of the CPU corresponding to the data structure, and the third section 58 is the identifier of the quad corresponding to the data structure. The second and third fields, 60 and 62, respectively, of the distributed reference counter store cache value and target value at a specific CPU. The target value is the desired value of cache. The remaining fields of the prior art distributed reference counter pertain to the number of increments per CPU 64, number of decrements per CPU 66, the number of increments per quad 68, and the number of decrements per quad 70. Each of the fields in the data structure have a cache size of 32 bits. Accordingly, the prior art reference counter contains separate fields for cache value and target value per CPU and per quad.
In conjunction with the reference counter, the prior art system includes a primitive for either increasing or decreasing the reference counter. The following is the per-CPU fastpath pseudocode for decrementing the reference counter of the prior art (pointed to by the argument xe2x80x9crepxe2x80x9d) by a specified amount and is used when the per-CPU cache has enough value to cover the amount of the decrement:
1. If the per engine counter structure referenced by a pointer to the current CPU""s value-cache structure belongs to some other CPU, invoke a function to handoff to an appropriate quad.
2. Atomically increment a count of the number of decrements.
3. If the value to be decremented by is too large, invoke the refcnt_decr_eng_large( ) to perform the decrement to avoid arithmetic overflow.
4. Select the field named rcp_value from the structure pointed to by pointer rep and copy to local variable xe2x80x9csnapshotxe2x80x9d, and select the field named rcp_target from the structure pointed to by pointer rep and copy to local variable xe2x80x9csnaptgtxe2x80x9d.
5. If xe2x80x9csnapshotxe2x80x9d is at least as large as the amount to be decremented by:
a) Atomically compare the value cache at this CPU with xe2x80x9csnapshotxe2x80x9d, and (if they are equal) exchange with the quantity xe2x80x9csnapshot-valuexe2x80x9d.
b) If the compare and exchange operation in step (5a) succeeds, return REFCNTxe2x80x94NONZERO to indicate that the aggregate value of the reference counter is greater than zero, otherwise continue with step 6.
6. If we reach this point, the fastpath for decrementing the reference counter failed, and conventional locking with interrupts disabled to complete the update should be invoked.
The following is the per-CPU fastpath pseudocode for incrementing the reference counter of the prior art by a specified amount:
1. If the per engine counter structure referenced by a pointer to the current CPU""s value-cache structure belongs to some other CPU, invoke a primitive handoff to an appropriate quad.
2. Atomically increment the field named xe2x80x9cnincrsxe2x80x9d from the structure pointed to by the pointer rep to count the increment.
3. There is no race-free fastpath for incrementing the reference counter, so conventional locking methods are used to complete the update.
Both the increment and decrement primitives of the prior art allocate 32 bit integers for both the value and target fields of the data structure. The value and target fields are separate fields within the structure and as such must be independently manipulated. When an increment of the counter takes place, a comparison and exchange of the cache and target values need to occur in a single operation in order to maintain accuracy. Although both the cache and target values may be stored in a single field of the data structure as 32 bit integers with the cache and target values limited to 16 bits each, the limits associated with storing such data in a 16 bit integer can result in overflow problems. Accordingly, there is a need for a distributed reference counter in which the cache values and the target values may be simultaneously modified as an atomic action.
It is therefore an object of the invention to provide a method for tracking a quantity of memory elements in a multiprocessor computer system. It is a further object of the invention to provide a distributed reference counter that combines the cache value and the target value into a single quantity to provide an efficient and accurate system for mitigating cache overflow.
A first aspect of the invention is a method of tracking a quantity of memory elements in a multiprocessor. The cache value and the target value are combined into a single quantity. Both the cache and target values are atomically manipulated as a single quantity. Furthermore, the target value of each CPU and each node is maintained and dynamically varied based upon the distance between the total value and the value limits.
A second aspect of the invention is a reference counter for use in a multiprocessor computer system including a single quantity for storing a cache value and a target value, and an instruction for controlling the quantity. The cache value and the target value are atomically manipulated by the instruction as a single quantity. In addition, the reference counter includes a cache manager to dynamically vary the desired level of cache in each CPU and node based upon the distance between a total value and the cache limits of the distributed reference counter.
A third aspect of the invention is an article comprising a computer-readable signal bearing medium, with multiple processors operating in the medium. The article includes combining a cache value and a target value into a single quantity and a means for atomically manipulating the quantity. In addition, the article includes a cache manager to dynamically vary a desired level of value in a node and a CPU based upon the distance between a total value and value limits.
Other features and advantages of this invention will become apparent from the following detailed description of the presently preferred embodiment of the invention, taken in conjunction with the accompanying drawings.