The present invention relates in general to multithreaded processor systems and in particular to a memory that can be shared by concurrent threads, where multiple threads can access the shared memory in parallel.
Parallel processing computer systems, including processors that can manage multiple concurrent threads, are known in the art. For large processing tasks, parallel processing can speed throughput by enabling the computer system to work on multiple independent parts of the processing task at once. For example, in graphics processors, each vertex or pixel is typically processed independently of all other vertices or pixels. Accordingly, graphics processors are usually designed with a large number of parallel processing pipelines for vertices and for pixels, allowing many vertices and/or pixels to be processed in parallel threads, which accelerates rendering of an image. The graphics pipelines usually do not share data with each other, apart from state parameters (also referred to as constants) that are usually common to large groups of vertex threads or pixel threads. The constants are usually stored in on-chip registers to which the pipelines have read access; any required updating of constants is handled via a separate control path.
For other types of processing tasks, it is sometimes desirable to allow different threads to share data. For instance, multiple threads may operate on different, overlapping parts of an input data set. As another example, it may be desirable for one thread to consume data produced by another thread. Sharing of data is usually managed by allowing multiple threads to access a common set of memory locations.
Existing shared memory systems tend to have significant overhead. In one model, shared memory is located on a separate chip from the parallel processors. Because the shared memory is off-chip, access is relatively slow. Further, semaphores or the like are typically used to prevent conflicting access requests so that, in effect, only one thread at a time has access to the shared memory. In another model, each processor in a multiprocessor parallel system maintains its own cached copy of all or part of the shared memory. Keeping the caches coherent, however, can incur considerable overhead.
It would therefore be desirable to provide a shared memory subsystem with low latency and support for multiple parallel access operations.