1. Field:
The following relates generally to utilization of parallel processing resources, and in one more particular aspect, to programming interfaces and architectures that can be used in different classes of parallel computing problems, and in particular to graphics processing, including ray tracing.
2. Related Art
Computer architectures seek to increase parallelism at a hardware level and at a software/operating system level. A variety of techniques have been proposed to address parallelism for hardware and software. One general concept is to use multi-threading as a mechanism to increase parallelism.
Regarding software-based multithreading, an operating system can concurrently execute multiple processes (typically, multiple independent programs) on a single hardware execution resource by interleaving in time the execution of the processes. Such interleaving can be accomplished by an OS kernel, which determines what process has priority, and also can access the resources it needs to proceed with execution.
When a process has two or more semi-independent subtasks, multiple threading can increase throughput, give better response time, speed operations, improve program structure, use fewer system resources, and make more efficient use of multiprocessors. With multi-threading, a process has multiple threads of control. The concept of multithreaded computing is increasingly common, as faster clocks for processors becomes prohibitively power consumptive, and transistor budgets are used on wider architectures. As such, the term “thread” has come into increasing usage, but is not always used to refer to the same concept, but rather a context of usage can inform the meaning of the term.
A process can be viewed as an individual program or application that can be executed. As such, a process has a sequence of instructions to be executed. A thread also has a sequence of instructions for execution, but typically is more granular. For example, a function can represent a thread's sequence of instructions. If instantiated by a process, a thread shares that process' address space, yet can own unique resources within the process. Instructions from each thread can be individually scheduled, which allows threads of a process to be executed concurrently, as resource availability dictates.
Threads can be managed by a creating process and be invisible to a kernel of an operating system (also referred to as user-level or application-level threads). User threads are handled in user space, and controlled using a thread API provided in a library. Such control includes operations such as creation, termination, synchronization, scheduling. Creation, termination, and synchronization operations can be performed using user-space threads. Because user-space threads are not directly visible to the kernel (which is aware only of the overriding process containing the user-space threads), if a user-space thread blocks, its entire process blocks. When this happens, the benefit of threads parallelism is lost or reduced. Additional layers of complexity can be introduced to reduce blocking, but at a performance cost.
Kernel threads are handled in kernel space and created by the thread functions in the threads library. Kernel threads are kernel schedulable entities visible to the operating system. Kernel threads exist within the context of a process and provide the operating system the means to address and execute smaller segments of the process. Kernel threads also enable programs to take advantage of capabilities provided by the hardware for concurrent and parallel processing. With kernel threads, each user thread can have a corresponding kernel thread. Each thread is independently schedulable by the kernel, so if an instruction being executed from one thread blocks, instructions from other threads may be able to run. Creation, termination, and synchronization can be slower with kernel threads than user threads, since the kernel must be involved in thread management. Overhead may be greater, but more concurrency is possible using kernel threads, even with a uniprocessor system. As a result, total application performance with kernel-space threads can surpass user-space thread performance. However, developers must be more careful when creating large amounts of threads, as each thread adds more weight to the process and burdens the kernel.
Kernel threads can change the role of processes, in that a process is more of a logical container used to group related threads of an application in such an environment. Each process contains at least one thread. This single (initial) thread is created automatically by the system when the process starts up. An application must explicitly create the additional threads. An application with only one thread is a “single-threaded.” An application with more than one thread is a “multi-threaded.” An example treatment of a thread with respect to other threads is provided below.
A process's “state” information includes a program counter indexing a sequence of instructions to execute, and register state. The register context and program counter contain values that indicate the current state of program execution. The sequence of instructions to execute is the actual program code. For example, when a process context switch takes place, the newly scheduled process's register information tells the processor where the process left off in its execution. More specifically, a thread's program counter would contain the current instruction to be executed upon start up.
Like the context of a process, the context of a thread consists of instructions, attributes, user structure with register context, private storage, thread structure, and thread stack. Like a process, a thread has a kind of life cycle based on the execution of a set of control instructions. Through the course of time, threads, like processes, are created, run, sleep, and are terminated. New processes (and in a multi-threaded OS, at least one thread for the process) typically are created using a fork ( ) call, which produces a process and thread ID. The process and its thread are linked to the active list. The new thread is flagged runnable, and thereafter it is placed in a run queue. The kernel schedules the thread to run, changing the thread state to an active running state. While in this state, the thread is given the resources it requests. This continues until a clock interrupt occurs, or the thread relinquishes its time to wait for a requested resource, or the thread is preempted by another (higher priority) thread. If this occurs, the thread's context is switched out.
A thread is switched out if it must wait for a requested resource (otherwise, the processor would block). This causes the thread to go into a sleep state. The thread sleeps until its requested resource returns and makes it eligible to run again. During the thread's sleep state, the kernel charges the currently running thread with CPU usage. After a period of time, the kernel can cause a context switch to another thread. The next thread to run will be the thread with the highest priority of the threads that are ready to run. For the remaining threads in that ready state, their priority can be adjusted upwards. Once a thread acquires the requested resource, it calls the wakeup routine and again changes state from sleep to run, making the thread eligible to run again. On the next context switch the thread is permitted to run, provided it is the next eligible candidate. When allowed to run, the thread state changes again to active running. Once the thread completes, it exits, releases all resources and can transfer to a zombie state. Once all resources are released, the thread and the process entries are released and the linked lists of active processes/threads can be updated.
Regarding more hardware-oriented parallelism, the capability of implementing parallel computing hardware is available in almost all computing devices, ranging from powerful mobile or desktop CPUs that may have 4, 8, 16, 32 or several dozen processing elements that are relatively complex to Graphics Processing Units (GPU) with many (e.g., hundreds) of relatively simple processors.
While GPU's where originally designed and still are primarily used to accelerate raster graphics, GPUs have gradually become more programmable. Such increased programmability has allowed for some multi-threaded computing problems to be expressed and executed on GPUs. Although GPU architectures have extremely high theoretical peak Floating Point Operations per Second (FLOPS), only rasterised graphics and certain well behaved, highly streamable compute problems can come close to realizing an actual throughput near to the theoretical throughput.
Increasing parallelism of computation typically involves a tradeoff between algorithmic complexity and overhead incurred in managing sharing of computation resources. Another important consideration is that algorithms execute correctly, such that data values are not corrupted. Parallel execution of threads, which may use shared data can cause data corruption by improper timing of reads and writes to shared data values. Negotiating access or serializing access to such shared data values incurs overhead.
Another concern largely is ensuring correctness of data during increasingly parallel execution. Principally, parallelism (and avoidance of data corruption) among different threads is handled through locks on variables that are at risk for being written (or read) out of correct order by conflicting processes in a system. The concept essentially is that when a given process wants to write to such a variable (e.g., a global variable), the process attempts to lock the variable. If no other process has a lock on the variable, then the lock can be granted to the requesting process. The process performs its write, and then releases the lock. As can be discerned, the usage of locks depends on a mechanism to detect or identify which variables should be protected by locks, and an arbiter to check lock status, grant lock status, and revoke lock status. Often, which variables need to be protected depends also on whether a given software procedure is intended to be available for usage on multithreaded platforms, in which a number of such instances of such procedure may be executing in concurrently executing processes.
Programmers can specify portions of code in a process to be protected by a lock. At compile time, a compiler can process such code, and in some cases also could detect variables that may be at risk, but which were not protected, and can insert such locks into the compiled object code.
One implementation of a lock is Mutual Exclusion (mutex), which controls how different portions of computer code can access a shared resource, such as a global variable. For example, a portion of code that needs to read a given variable should not be allowed to execute during an update of that variable by another process. The term mutex also is used in the sense of a program element that operates to negotiate a mutual exclusion of different program elements for conflicting variables. In a multi-threaded execution environment, a number of threads can be executing, and ultimately may need to access a given data object protected by a mutex. If the mutex currently is active, then typically the operating system will cause the threads that are waiting for that mutex to go to sleep (i.e., cease execution). Thus, as more and more threads reach the mutex, they each are made to go to sleep to await a time when the operating system indicates that they can continue execution. The operating system can wake these threads by a relative order in which they arrived at the mutex; the order in which they are awakened may simply be indeterminate.
Other approaches to allocation of shared resources includes semaphores, which allocate a set of undifferentiated resource elements among processes that desire to use the elements. A semaphone essentially counts how many elements of the resource are available for use from the set, adjusting the count as resources are reserved and released. When no elements of a given resource are available, then a process requesting use of the resource waits until an element of the resource has been released. Typically, semaphores are implemented in operating systems for resource allocation.
Variations on basic locking concepts exist, such as spinlocking, which allows threads to continue to actively ask when a given lock is free, such that their responsiveness is increased by not having to be awakened, which incurs overhead (context swapping). However, spinlocking does use processor cycles, and consequently inherently reduces efficiency. Other variations include recursive locking, implementing time limits on locks, such that a process will not wait indefinitely to acquire the lock for a particular object, but instead continue execution, if possible, if its lock request is not fulfilled in a given time period. Such an approach is useful in embedded systems, where absolute timing execution for critical processes may be necessary.
A further example variation is recursive locking, which is a strategy proposed for computing architectures that have Non-Uniform Access (NUMA) to a shared memory. Typically, groups of general purpose processors, each with an onchip cache, and operating in a system with a shared main memory will have NUMA properties (practically, a large majority of low-cost computation platforms). Recursive locking is based on the realization that a process resident on a processor that has a locked memory location (standard test and set locking) is more likely to be granted access to that locked memory location than a process executing on a different chip, due to differences in communication delay incurred by interconnect between the processors. Recursive locking seeks to enforce a predictable and controllable preference for granting a lock to a process that may have more data locality by granting locks to processes executing on the node that already has the lock by adding a condition to the lock granting process that tests an identifier for which node is hosting the thread requesting the lock.