Automatic memory management is one of the services Common Language Runtime (CLR) provides to an application during execution. Such memory management includes, for example, garbage collection (GC) to manage the allocation and release of memory for an application. GC implementations, such as the CLR GC, are often generational, based on a notion that newly generated objects are short-lived, tend to be smaller, and are accessed often. To this end, a generational GC (GGC) keeps track of object references from older to younger (i.e., object generations) so that younger objects can be garbage-collected without inspecting every object in older generation(s). For instance, generation zero (G0) contains young, frequently used objects that are collected often, whereas G1 and G2 are used for larger, older objects that are collected less frequently.
To facilitate GGC, an application's memory heap is divided into multiple equally sized cards that are usually bigger than a word and smaller than a page. The GGC uses a “card table”, which is typically a bitmap, to map each card to one or more respective bits, usually a byte. At every reference (i.e., store instruction) to a card that creates or modifies a pointer from an older to a newer object, the GGC records/marks the card being written into by setting the card's corresponding card table bits. Subsequently, when scanning an older generation to identify intergenerational references for garbage collection (i.e., when collecting a younger generation), only the cards (in the old generation) identified by corresponding marked card table bits are scanned.
Card-marking is also a well known technique to implement “write barrier”. In particular, a write barrier call is inserted by the compiler in places where there is a store object reference instruction. This write barrier stores the object reference and also marks the card corresponding to the location of the store. Such card marking is required to be atomic with respect to other processors/threads to ensure that one thread does not undue another thread's work. Although such thread synchronization maintains data integrity, it also typically slows down thread execution, and thereby, overall system performance.
In view of this, certain programming techniques may be used to reduce the probability that more than a single thread will compete for access to any particular object at any one time. Such techniques generally involve storing each object in its own cache line (i.e., an object will not share a same cache line with any other object). This technique effectively reduces competition by multiple threads for a same cache line during object store operations. Unfortunately, this programming technique does not alleviate problems caused when multiple threads compete for a same cache line in the card table, wherein each card of a system's main memory is represented with one or more bits, during card marking operations. To make matters worse, such conventional programming techniques are not realistically transferable to the card table because prohibitive amounts of memory would be required to represent each of the card table's atomic values (one or more bits mapped to a card) with its own cache line.
In view of this, systems and methods to improve system performance during card marking/write barrier operations are greatly desired.