Computer technologies continue to develop in the direction of multicore computing, due to power consumption and thermal performance concerns, and the need to continue the trend of high performance computing. To optimize the use of the multicore architecture, an application program is often divided into multiple threads each run separately on a single core (processor) to realize parallel computing with higher computing efficiency.
FIGS. 1A and 1B show a schematic design of an existing multicore architecture. FIG. 1A has 16 CPU cores P1, P2, . . . P16, interconnected using a routing system (represented by thicker lines) to allow inter-core visiting among the cores. FIG. 1B shows a schematic structure of each CPU core with caches *Ln and LLC, where *Ln represents First Level Cache (L1) and/or Second Level Cache (L2), while LLC stands for Last Level Cache. *Ln and LLC are connected through the routing system, and LLC has a directory which is also connected through the routing system. As the processors read out data from a memory (not shown), the data may be distributed among the caches of the multiple cores (processors).
In order to keep the data synchronized, different threads may need to be managed by a synchronization mechanism to access shared regions, which traditionally required a serial access by multiple threads. Transactional memory design has been introduced to increase the level of parallelism. Transactional memory handles computing by dividing the program into many transactions and processing each transaction separately. During the processing of each transaction, the state of the transaction is hidden from and unaffected by the other processors. After the transaction is processed, the results are then committed to the global system. Instead of assuming “pessimistically” that different threads will clash and therefore locks are required, transactional memory takes a more “optimistic” approach in assuming that different threads will generally not clash unless a clash is detected. When a clash is detected, the state of the program can be rolled back to the state before the clash, thus maintaining the data integrity. Transactional memory is presently used in CPU architectures, including Blue Gen of IBM and Haswell of Intel.
Transactional memory can be realized in two different ways, either using software or hardware. Software transactional memory suffers low efficiency and low speed, while hardware transactional memory has significantly increased the usefulness of the technology. This disclosure is directed to hardware transactional memory.
Transactional memory assumes that among multicore threads, visitations of shared data rarely cause write-read, read-write, and write-write conflicts, and therefore multiple threads are allowed to run in parallel. By hiding modified states of the data during a transaction, and rolling back upon a conflict, the system performance and scalability are increased without sacrificing data integrity.
Although transactional memory increases the parallelism of multicore systems, the collision rate increases as the level of parallelism increases to cause an excessive amount of rolling back which may have a large negative impact on the program performance.
Theoretically, the pre-invalidation technique may improve the execution of critical regions by significantly reducing conflicts when modifying shared data. However, the pre-invalidation technique requires that the global data state be changed. If the pre-invalidation is directly applied to the existing transactional memory design, it would directly contradict with the transactional memory design, which requires that state being hidden during modification. Pre-validation and transactional memory therefore cannot be simply combined.