The present invention is concerned with computer programs, and is more particularly concerned with computer programs that support multiple concurrent control threads.
Many conventional operating systems support multiple concurrent control threads to allow multitasking of application programs. In some cases, duplicate control threads are operated concurrently, as occurs, for example, when plural input/output devices of the same type are independently controlled by respective control threads.
It is also conventional that each control thread be provided with a call stack. As is very familiar to those who are skilled in the art, a call stack is a collection of data structures called stack frames in which a previous stack frame pointer, return instruction pointer and relevant data are stored upon initialization of a call routine. The next instruction pointer and the relevant data are retrieved from the previous stack frame on the call stack at the end of the call routine.
The present invention addresses a problem that may be encountered in connection with caching of call stacks for multiple duplicate control threads.
FIG. 1 is a simplified block diagram of a conventional computer system in which the present invention may be applied. Reference numeral 100 generally indicates the computer system, from which many customary components have been omitted to simplify the drawing. The computer system 100 includes a processor (CPU) 110, which may, for example, be one of the PowerPC(trademark) family of devices available from International Business Machines. A cache memory 120 is associated with the processor 110. For example, the cache memory 120 may be on board the processor 110. Also accessible by the processor 110 are a non-volatile memory (e.g., ROM) 130 and a volatile memory (e.g., RAM) 140. The non-volatile memory 130 and the volatile memory 140 are connected to the processor 110 via a memory controller 150, memory busses 160 and a processor bus 170.
When the processor 110 performs a load or store instruction, the cache memory 120 is interrogated to determine if the needed data resides within the cache memory 120. If so, then the processor 110 either loads the data from or modifies the data within the cache memory 120. If the data does not reside within the cache memory 120, then the processor 110 issues a cache line fill operation to the memory controller 150. The memory controller 150 retrieves the data from the appropriate source, which may be either the non-volatile memory 130 or the volatile memory 140. The memory controller 150 then returns the data to the processor 110. If there is a usable empty line in the cache memory 120, then the data received via the memory controller 150 is placed in the empty line. If not, then a line of data is evicted from the cache memory 120, and the new data is put in the place of the evicted data. The processor 110 then loads the data from or modifies the requested data within the cache memory 120.
Because the cache memory 120 is on board or otherwise closely associated with the processor 110, accessing data in the cache memory 120 is much more rapid than accessing data in the non-volatile memory 130 or the volatile memory 140. Thus the use of the cache memory 120 may improve the performance of the computer system 100.
A cache memory or xe2x80x9ccachexe2x80x9d is customarily defined by its number of ways (described below), line size and total size. The number of blocks that a cache memory is divided into is equal to the total size of the cache memory divided by the line size.
When the data required by the processor 110 is not present in the cache memory 120 (which is sometimes referred to as a xe2x80x9ccache missxe2x80x9d), the necessary data, in a length equal to the cache line size, is brought into the cache memory 120 from external memory such as the non-volatile memory 130 or the volatile memory 140 referred to above. The line of data is placed in one of the blocks of the cache memory 120. The xe2x80x9cassociativityxe2x80x9d of the cache determines which blocks the data can be placed in. In a xe2x80x9cfully associativexe2x80x9d cache, the data can be placed in any of the blocks. In a xe2x80x9cdirect mappedxe2x80x9d cache, the data can be placed in only one block, which is indicated by the least significant bits of the memory address from which the data was obtained.
In an xe2x80x9cn-way set associativexe2x80x9d cache, the memory address from which the data was obtained maps to one xe2x80x9csetxe2x80x9d of the cache. A set contains a number of blocks that is equal to n. The number of sets in a cache is determined by dividing the number of blocks in the cache by n. (A direct mapped cache can be thought of as a one-way set associative cache. A fully associative cache with m blocks can be thought of as an m-way set associative cache. Alternatively, a direct mapped cache can be thought of as having m sets, assuming m blocks in the cache, and a fully associative cache can be thought of as having one set.)
It is common to provide cache memories that are 2-way or 4-way set associative.
The particular set in which data will be placed in an n-way set associative cache is determined based on the memory block number (data address divided by cache line size) modulo the number of sets in the cache. The particular block within the set that is used to store the data may be determined, for example, using a Least Recently Used (LRU) algorithm.
The xe2x80x9cspanxe2x80x9d of a cache is determined by dividing the total size of the cache by the number of ways. The span determines the range of data addresses that can be used before data addresses start to be placed within the same set. For example, assume that a cache has a span of 1K bytes and a cache line size of 32 bytes so that the cache has 32 sets. Data from a main memory data address of 0x400 is placed within set 0. Data from a main memory data address of 0x500 is placed in set 8, and data from a main memory data address of 0x800 is placed again within set 0. The number of different memory addresses that can be serviced in one set before a conflict occurs is determined by the number of ways of the cache memory.
FIGS. 2A-D are schematic illustrations of mapping of main memory addresses to different types of cache memories, in accordance with conventional practices. In FIGS. 2A-D, reference numeral 210 (FIG. 2A) indicates a simplified representation of main memory (such as the non-volatile and/or volatile memory 130, 140), reference numeral 230 (FIG. 2B) indicates a direct mapped cache, reference numeral 240 (FIG. 2C) indicates a fully associative cache, and reference numeral 250 (FIG. 2D) indicates a 2-way set associative cache. It is assumed that block 220 in memory 210 is to be cached. Block 220 has a block address of 13. The direct mapped cache 230 (FIG. 2B) has eight blocks or sets (indicated by reference numeral 235). Since 13 modulo 8 equals 5, the data from block 220 (block address 13) of the main memory 210 would be placed in block number 5 of the direct mapped cache 230, as indicated by reference numeral 280.
In the case of the fully associative cache 240 (FIG. 2C), there are eight blocks (reference numeral 245) corresponding to eight ways or one set. The data from block 220 of the main memory 210 can be placed in any of the eight blocks of the fully associative cache 240, as indicated by reference numeral 283.
In the case of the 2-way set associative cache 250 (FIG. 2D), there are four sets (reference numeral 255) of two blocks each. Since 13 modulo 4 equals 1, the data from block 220 (block address 13) of the main memory 210 (FIG. 2A) may be stored in either of the two blocks (blocks 2 and 3) of set 1 (reference numeral 260), as indicated by reference numeral 286 (FIG. 2D).
FIG. 3 is a schematic illustration that illustrates a problem identified by the present inventor that may be encountered in connection with caching of call stacks for duplicate control threads. In the example shown in FIG. 3, it is assumed that four duplicate threads respectively have stacks 305 (stack A), 310 (stack B), 315 (stack C) and 320 (stack D).
It is further assumed that each of the stacks 305, 310, 315, 320 is allocated one page (4K) of virtual memory space, and that stack A (reference numeral 305) starts at address 0x4000, stack B (reference numeral 310) begins at address 0x8000, stack C (reference numeral 315) begins at address 0x6000 and stack D (reference numeral 320) begins at address 0xC000.
It is further assumed that a 2-way (Way A and Way B) set associative cache 325 is employed. In this example, the cache 325 has a total size of 8K bytes, with cache lines 32 bytes in length, providing 128 sets.
As noted above, the stacks 305, 310, 315 and 320 are assumed to be used by duplicate control threads, i.e., four separate instantiations running the same thread of instructions. It is further assumed that each stack has a highly utilized area, indicated respectively at 330, 332, 334 and 336 for the stacks 305, 310, 315 and 320. Since the stacks 305, 310, 315 and 320 correspond to duplicate threads, the highly utilized areas 330, 332, 334 and 336 are at identical offsets (bytes 0x0200 to 0x07FF) within each stack. These highly utilized areas 330, 332, 334 and 336 all map to the same sets (0x10 through 0x3F) within the cache 325, as indicated by reference numeral 340. Because the cache 325 is only 2-way, only two of the stacks 305, 310, 315, 320 can have data within a set at any one time. Since four identical threads having the stacks 305, 310, 315, 320 are in competition for the sets indicated by reference numeral 340, there are frequent conflicts, leading to xe2x80x9cthrashingxe2x80x9d, i.e., data for one of stacks 305, 310, 315, 320 frequently being evicted from the cache 325 to make room for data of another one of the stacks 305, 310, 315, 320. The overhead resulting from frequent eviction of data from the cache 325, and frequent cache misses, may adversely affect the performance of the computer system.
According to a first aspect of the invention, a method of initializing a control thread in a multi-thread software program is provided. The method includes receiving an instruction to initialize a new control thread, determining whether the new control thread is a duplicate of an existing control thread, and setting a stack offset for the new control thread based on a result of the determining step. A xe2x80x9cstack offsetxe2x80x9d will be understood to mean the offset within a memory space allocated to a call stack at which the first stack frame for the call stack is placed. A xe2x80x9cstack framexe2x80x9d means a collection of data placed on a call stack for a particular instruction. A stack frame may include, for example, a return instruction pointer, a previous frame pointer and local variables.
In at least one embodiment, the determining step may include comparing a first instruction pointer for the new control thread with a first instruction pointer for the existing control thread. If the new control thread is determined to be a duplicate of the existing control thread, the stack offset for the new control thread may be set to be different from a stack offset of the existing control thread. In one or more embodiments, the stack offset of the existing control thread may be zero. In one or more embodiments the stack mapping for the new control thread may have a last virtual page that is equal in size to the stack offset for the new control thread and is mapped to begin at a zero address of a physical memory page. A xe2x80x9cstack mappingxe2x80x9d will be understood to mean a mapping to physical memory of one or more virtual memory pages allocated to a call stack.
In at least one embodiment, the setting of the stack offset for the new control thread may include adding a predetermined value to the stack offset of the existing control thread. For example, the stack offset of the new control thread may be set to zero if the sum of the predetermined value and the stack offset of the existing control thread equals a span length of a cache used to store call stacks for the new control thread and the existing call thread.
According to a second aspect of the invention, another method of initializing a control thread in a multithread software program is provided. The inventive method according to the second aspect of the invention includes receiving an instruction to initialize a new control thread and traversing a list of existing control threads. The inventive method according to the second aspect of the invention further includes determining, for each existing control thread in the list of existing control threads, whether the new control thread is a duplicate of the existing control thread; and setting a stack offset for the new control thread based on a result of the determining step.
According to a third aspect of the invention, a method includes comparing a new control thread with an existing control thread, determining whether the new control thread is a duplicate of the existing control thread on the basis of the comparing step, and, if the new control thread is determined to be a duplicate of the existing control thread, setting a stack offset for the new control thread to be different from a stack offset of the existing control thread.
Numerous other aspects are provided, as are computer systems which implement the above-described methods, and computer program products. Each inventive computer program product may be carried by a medium readable by a computer (e.g., a carrier wave signal, a floppy disk, a hard drive, a random access memory, etc.).
By setting respective stack offsets to be different for each duplicate control thread, the present invention may reduce cache conflicts, thereby enhancing the efficiency of operation of a multi-threaded program when the multiple threads include duplicate threads.
Other objects, features and advantages of the present invention will become more fully apparent from the following detailed description of exemplary embodiments, the appended claims and the accompanying drawings.