The present technique relates to an apparatus and method for processing thread groups.
In highly multithreaded architectures, such as often adopted by graphics processing units (GPUs), it is known to arrange the threads into thread groups. Whilst each thread group may contain one or more threads, in systems such as GPUs it is often the case that each thread group comprises a plurality of threads that are arranged to execute associated program code, such thread groups often being referred to as warps. An apparatus arranged in such a way can often achieve high computational throughput, since many threads can issue each cycle, and stalls in one thread can be hidden by switching to processing another thread. However, to achieve such high computational throughput, it is necessary for the apparatus to store the context for every active thread in a way that makes it available when required.
Registers make up a very significant proportion of each thread's state, and as a result such an apparatus has typically had to have a very large register file in order to ensure that the registers required by every active thread can be accessed as needed. However, the requirement for a large register file has area and energy consumption impacts, and accordingly it would be desirable to reduce the area and energy consumption requirements when providing the required registers, whilst avoiding an adverse impact on performance.