As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multi-threading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A SIMD or vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, an SIMD or vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.
The aforementioned techniques may also be combined, resulting in a multi-threaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to an SIMD execution unit to process “vectors” of data points at the same time. Typically, a scheduling algorithm is utilized in connection with issue logic to ensure that each thread is able to proceed at a reasonable rate, with the number of bubbles in the execution unit pipeline kept at a minimum.
It has been found, however, that while this configuration is highly desirable for a significant amount of code, there are certain algorithms that are inefficient to execute in a conventional execution unit. For example, oftentimes in computer graphics applications such as 3D video games, there is a need for computer-generated textures (also referred to as “procedural textures”) rather than bitmapped textures. Procedural textures have a number of advantages over bitmapped textures. They use much less memory and memory bandwidth, they generally don't require computationally intensive texture filtering, and in some circumstances they can look significantly more realistic. However, to get a satisfactory amount of realism from a procedural texture, there must be a good amount of “randomness” to emulate many visuals found in the natural world. The need for such randomness has traditionally been a major drawback to using procedural textures due to the fact that producing good random numbers in conventional hardware is typically a very computationally intensive operation.
Oftentimes, pseudorandom number generator functions are used to deterministically generate numbers suitable for use as random numbers. A pseudorandom function typically outputs the same value for any given input value, but the function itself generates numbers that appear to be random in nature. As an example, Table I below shows example pseudocode for a pseudorandom number generator function call for use in a Perlin noise procedural texture application:
TABLE IPerlin Noise Functionfloat noise1(int32 in){int32 x = (in << 12) {circumflex over ( )} in;int32 tmp = (x * (CONST_INT_PRIME1 * x * x +CONST_INT_PRIME2) + CONST_INT_LARGE1);tmp &= 0x7FFFFFFF;float returnf = (1.0 − tmp) / CONST_FLOAT_LARGE;return returnf;}
With conventional execution units, it would take at least 9 full passes through an arithmetic pipeline to perform this function each time it is called (which would usually be multiple times per pixel), with many of those passes dependent on a previous pass, making this a very slow function. Table II below, for example, shows example pseudo assembly language for implementing the aforementioned Perlin noise function:
TABLE IIPerlin Noise Function Assembly Codenoise1: #arguments in register rinslwirin, rin, 12# shift contents of GPR rin left by 12 bitsmvtmp, rin# copy contents of rin to tmpmvrslt, rin# copy contents of rin to rsltsubitmp, tmp, 1# tmp = tmp − 1bledone# skip loop if count is less than or equal to 0loop:# this loop implements the exponentiation rin{circumflex over ( )}rinmulrslt, rslt, rin# multiply temporary result with rinsubi.tmp, tmp, 1# subtract count by 1bgtloop# if count is greater than 0, branch to loopdone:mul tmp1, rslt, rslt# square rsltmul tmp1, tmp1, prime1# tmp1 = rslt{circumflex over ( )}2 * prime1add tmp1, tmp1, prime2# tmp1 = rslt{circumflex over ( )}2 * prime1 + prime2mul tmp1, tmp1, rslt# tmp1 = rslt*(rslt{circumflex over ( )}2 * prime1 + prime2)add tmp1, tmp1, const1# tmp1 = rslt*(rslt{circumflex over ( )}2 * prime1 + prime2 )+constfconvfint fin, tmp1# convert GPR value to float in finfabsfin, fin# fin = absolute value of finfsubfin, fone, fin# fin = 1.0 − finfmulfrslt, fin, fconst2# frslt = fin * small_normalizing_constantblr# branch to link register# (return from subroutine, result in frslt)
In addition, oftentimes the pseudorandom number result is passed into another function, such as a multiply or trigonometric function, which takes additional cycles, since the dependent instruction must be stalled until the pseudorandom function completes. For instance, in some Perlin noise generation algorithms, the pseudorandom number from the 1st noise function call is passed into a smoothing function, and then interpolated. A smoothing function, however, may involve the noise function result or a sum of several noise function results with a power of two.
As an example, Table III below shows example pseudocode for a one dimensional smoothing function call for use in a Perlin noise procedural texture application, while Table IV shows example pseudocode for a two dimensional smoothing function call for use in a Perlin noise procedural texture application:
TABLE IIIOne Dimensional Smoothing Functionfloat smthNoise1(int32 x){return (0.25 * noise1(x−1)) + (0.5 * noise1(x)) + (0.25 *noise1(x+1));}
TABLE IVTwo Dimensional Smoothing Functionfunction SmoothNoise_2D(x, y)corners = ( Noise(x−1, y−1)+Noise(x+1, y−1)+Noise(x−1, y+1)+Noise(x+1, y+1) ) / 16sides = ( Noise(x−1, y) +Noise(x+1, y) +Noise(x, y−1) +Noise(x, y+1) ) / 8center = Noise(x, y) / 4return corners + sides + centerend function
Table V below shows example pseudo assembly language for implementing the aforementioned one dimensional smoothing function. For simplicity, however, assembly language for the aforementioned two dimensional smoothing function is not provided herein:
TABLE VOne Dimensional Smoothing Function Assembly Codesmoothnoise1:# input to function is in the GPR rinmvrin_tmp, rin# copy rin to rin_tmpblnoise1# call the noise1 function (result in frslt)fmulfrslt_tmp, frslt, fhalf# frslt_tmp = frslt * 0.5subrin, rin_tmp, one# tmp = original_rin − 1blnoise1# call the noise1 function (result in frslt)fmaddfrslt_tmp, frslt, ffourth, frslt_tmp# frslt_tmp += (frslt *  0.25)addrin, rin_tmp, one# tmp = original_rin + 1blnoise1# call the noise1 function (result in frslt)fmaddfrslt_tmp, frslt, ffourth, frslt_tmp# frslt_tmp += (frslt *  0.25) # final result in frslt_tmp
Even in a highly optimistic scenario with a seed value of 3, the aforementioned one dimensional smooth function above would need approximately 50 dependent pipeline passes to complete.
Pseudorandom number generation has other uses beyond generating noise for use in procedural texture generation. For example, pseudorandom number generation is used in cryptography, as well as in simulations. However, in many conventional execution units, the generation of pseudorandom numbers adversely impacts performance due to the highly dependent nature of known algorithms.
Therefore, a need exists in the art for a manner of improving the performance of an execution unit in generating pseudorandom numbers for use in arithmetic operations.