This invention relates to matrix-vector calculations on general-purpose computers, and particularly to systems and methods for a combined matrix-vector and matrix-transpose vector multiply for a block sparse matrix on a single instruction, multiple data (SIMD) processor.
A physics engine for an online game provides simulation of a simplified model of real-world physics. Typically, integration is required to determine, given position and momentum information about the objects in the world, where each object will be after the next time step. This determination can be done using semi-implicit integration of a penalty force-based system (a dynamic system that enforces constraints by applying a restorative force when a constraint is violated). The properties of the system are represented as a block-sparse matrix A and a vector b. The biconjugate gradient algorithm is used to solve A*Δv=b, and then the vector Δv is used to update the system for the next time step. For the online game application, speed of computation is more important than great accuracy; it is sufficient to perform these calculations in single precision floating point without taking special measures to ensure numerical accuracy.
The physics engine for an online game may operate as follows. In a typical online game physics engine, each object in the virtual world is represented as one or more rigid bodies; multiple bodies may be necessary to represent such things as robots with moving parts. The system stores information about the position and motion of each rigid body at a given time t0, along with an indication of whether each body is potentially moving (active) or at rest (sleeping). This indication reduces the amount of calculation required because unless something happens to a sleeping body, it remains where it is and there is no need to calculate a new position for it. In order to derive the positions and motions of the active bodies at some future time t0+Δt, the following steps are performed: 1) determine which of the rigid bodies are colliding; 2) wake any sleeping bodies with which an active body has collided; 3) partition the active bodies into non-interacting groups; 4) put to sleep the active bodies in any group which has come to rest; and 5) integrate each of the groups of active bodies to determine their position/motion for the next step.
For game systems developed to run on the Cell Broadband Engine Architecture (Cell BE), the first four steps run on the power-processing element (PPE), the main processor, which acts as a controller for the several synergistic processing elements (SPEs). The integration step (which is numerically intensive) is farmed out to the SPEs. Ideally, each partition of the set of active bodies is handled by a single SPE. In practice, the size of the group that fits on an SPE is constrained by the amount of local store available on the SPE. Therefore, an additional step of decoupling, which breaks up the groups into smaller groups, is used to obtain groups that fit on an SPE. This step introduces some errors into the simulation, but they can be managed. However, the necessity of decoupling places a premium on a space-efficient implementation of the integration algorithm; the less storage is required to integrate a group of a given size, the larger the group that can be integrated without decoupling.
To integrate a group on the SPE, the PPE supplies information on the positions and motions of the rigid bodies in the group, the forces which connect them (these can be used to define hinges and joints), and the points where one rigid body is in contact with another. The SPE then performs the following steps: 1) form the matrix A and the vector b which describe the system; 2) solve A*Δv=b using the biconjugate gradient method; and 3) return Δv to the PPE, which uses it to update the positions and motions of the rigid bodies.
In the maximal coordinates representation, each rigid body contributes six coordinates to b, and A is a square matrix containing six elements per dimension for each rigid body. A is typically block-sparse; that is, it can be divided into a grid of 6×6-element submatrices, most of which are all zero but a few of which have nonzero elements. In particular, the diagonal blocks are nonzero, and there is a pair of off-diagonal nonzero blocks for each pair of bodies that interact directly. It is well known that by storing only the nonzero blocks, along with information giving the position of each block within the matrix, the storage required for the array A can be greatly reduced, and the computation required to perform calculations which can only yield zero results can be avoided.
The biconjugate gradient algorithm used to solve A*Δv=b is well-known, and can be expressed as follows:
Pseudocode:Compute r(0) = b − Ax(0) for some initial guess x(0).Choose {tilde over (r)}(0) (for example, {tilde over (r)}(0) = r(0)).for i = 1, 2, ...solve Mz(i−1) = r(i−1)solve MT{tilde over (z)}(i−1) = {tilde over (r)}(i−1)ρi−1 = z(i−1)τ{tilde over (r)}(i−1)if ρi−1 = 0, method failsif i = 1p(i) = z(i−1){tilde over (p)}(i) = {tilde over (z)}(i−1)elseβi−1 = ρi−1/ρi−2p(i) = z(i−1) + βi−1p(i−1){tilde over (p)}(i) = {tilde over (z)}(i−1) + βi−1{tilde over (p)}(i−1)endifq(i) = Ap(i){tilde over (q)}(i) = AT{tilde over (p)}(i)αi = ρi−1/{tilde over (p)}(i)τ q(i)x(i) = x(i−1) + αip(i)r(i) = r(i−1) − αiq(i){tilde over (r)}(i) = {tilde over (r)}(i−1) − αi{tilde over (q)}(i)check convergence; continue if necessaryend
The pseudocode above is from a “Mathworld” (mathworld.com) article by Noel Black and Shirley Moore, adapted from Barret et al (1994). The computation at the top of the loop is sometimes simplified (at a cost of possibly having to do more iterations) by replacing the preconditioner M with the identity matrix. In this case the first two lines in the loop become z(i-1)=r(i-1) and {tilde over (z)}(i-1)={tilde over (r)}(i-1). Inspection of the subsequent calculations shows that it is not necessary to form z and {tilde over (z)} at all; r and {tilde over (r)} can be substituted. The most computationally intensive part of this procedure that remains is the formation of the two matrix-vector products q(i)=Ap(i) and {tilde over (q)}(i)=AT{tilde over (p)}(i). The straightforward way to do this computation is to have a procedure which multiplies a matrix and a vector together, to use it to obtain the first product, to form the transpose of A, and to use the routine again to form the second product. The disadvantages of this are the computation and storage required to form the transpose of A and the fact that the elements of A must be read out of storage twice per iteration for the multiplications, once from the original matrix and once from the transpose. In addition, for a processor such as the Cell BE SPE, the number of instructions used to load and store the data required for a matrix-vector multiply is much larger than the number of arithmetic instructions needed to perform the calculations, so that the full capability of the processor cannot be utilized.
There persists a need for modifications to the above-discussed pseudo-code for a combined matrix-vector and matrix-transpose vector multiply for a block sparse matrix on a single instruction, multiple data (SIMD) processor having the characteristics of the Cell BE SPE, namely, the ability to execute many instructions at once but with latencies of several cycles before results are available, and the desirability of having the number of arithmetic and logical computations at least as large as the number of loads and stores.