1. Field of the Invention
Embodiments of the present invention relate generally to circuits and methods for performing massively parallel computations. More particularly, embodiments of the invention relate to an integrated circuit architecture and related methods adapted to generate real-time physics simulations.
2. Description of Related Art
Recent developments in computer games have created an expanding appetite for sophisticated, real-time physics simulations. Relatively simple physics-based simulations have existed in several conventional contexts for many years. However, cutting edge computer games are currently a primary commercial motivator for the development of complex, real-time, physics-based simulations.
Any visual display of objects and/or environments interacting in accordance with a defined set of physical constraints (whether such constraints are realistic or fanciful) may generally be considered a “physics-based” simulation. Animated environments and objects are typically assigned physical characteristics (e.g., mass, size, location, friction, movement attributes, etc.) and thereafter allowed to visually interact in accordance with the defined set of physical constraints. All animated objects are visually displayed by a host system using a periodically updated body of data derived from the assigned physical characteristics and the defined set of physical constraints. This body of data is generically referred to hereafter as “physics data.”
Historically, computer games have incorporated some limited physics-based simulation capabilities within game applications. Such simulations are software based and implemented using specialized physics middle-ware running on a host system's Central Processing Unit (CPU), such as a Pentium®. “Host systems” include, for example, Personal Computers (PCs) and console gaming systems.
Unfortunately, the general purpose design of conventional CPUs dramatically limit the scale and performance of conventional physics simulations. Given a multiplicity of other processing demands, conventional CPUs lack the processing time required to execute the complex algorithms required to resolve the mathematical and logic operations underlying a physics simulation. That is, a physics-based simulation is generated by resolving a set of complex mathematical and logical problems arising from the physics data. Given typical volumes of physics data and the complexity and number of mathematical and logic operations involved in a “physics problem,” efficient resolution is not a trivial matter.
The general lack of available CPU processing time is exacerbated by hardware limitations inherent in the general purpose circuits forming conventional CPUs. Such hardware limitations include an inadequate number of mathematical/logic execution units and data registers, a lack of parallel execution capabilities for mathematical/logic operations, and relatively limited bandwidth to external memory. Simply put, the architecture and operating capabilities of conventional CPUs are not well correlated with the computational and data transfer requirements of complex physics-based simulations. This is true despite the speed and super-scalar nature of many conventional CPUs. The multiple logic circuits and look-ahead capabilities of conventional CPUs can not overcome the disadvantages of an architecture characterized by a relatively limited number of execution units and data registers, a lack of parallelism, and inadequate memory bandwidth.
In contrast to conventional CPUs, so-called super-computers like those manufactured by Cray® are characterized by massive parallelism. Further, while programs are generally executed on conventional CPUs using Single Instruction Single Data (SISD) operations, super-computers typically include a number of vector processors executing Single Instruction-Multiple Data (SIMD) operations. However, the advantages of massively parallel execution capabilities come at enormous size and cost penalties within the context of super-computing. Practical commercial considerations largely preclude the approach taken to the physical implementation of conventional super-computers.
Thus, the problem of incorporating sophisticated, real-time, physics-based simulations within applications running on “consumer-available” host systems remains unmet. Software-based solutions to the resolution of all but the most simple physics problems have proved inadequate. As a result, a hardware-based solution to the generation and incorporation of real-time, physics-base simulations has been proposed in several related and commonly assigned U.S. patent application Ser. Nos. 10/715,459; 10/715,370; and 10/715,440 all filed Nov. 19, 2003. The subject matter of these applications is hereby incorporated by reference.
As described in the above referenced applications, the frame rate of the host system display necessarily restricts the size and complexity of the physics problems underlying the physics-based simulation in relation to the speed with which the physics problems can be resolved. Thus, given a frame rate sufficient to visually portray an simulation in real-time, the design emphasis becomes one of increasing data processing speed. Data processing speed is determined by a combination of data transfer capabilities and the speed with which the mathematical/logic operations are executed. The speed with which the mathematical/logic operations are performed may be increased by sequentially executing the operations at a faster rate, and/or by dividing the operations into subsets and thereafter executing selected subsets in parallel. Accordingly, data bandwidth considerations and execution speed requirements largely define the architecture of a system adapted to generate physics based simulations in real-time. The nature of the physics data being processed also contributes to the definition of an efficient system architecture.
Several exemplary architectural approaches to providing the high data bandwidth and high execution speed required by sophisticated, real-time physics simulations are disclosed in a related and commonly assigned U.S. patent application Ser. No. 10/839,155 filed May 6, 2004, the subject matter of which is hereby incorporated by reference. One of these approaches is illustrated by way of example in Figure (FIG.) 1 of the drawings. In particular, FIG. 1 shows a physics processing unit (PPU) 100 adapted to perform a large number of parallel computations for a physics-based simulation.
PPU 100 typically executes physics-based computations as part of a secondary application coupled to a main application running in parallel on a host system. For example, the main application may comprise an interactive game program that defines a “world state” (e.g., positions, constraints, etc.) for a collection of visual objects. The main application coordinates user input/output (I/O) for the game program and performs ongoing updates of the world state. The main application also sends data to the secondary application based on the user inputs and the secondary application performs physics-based computations to modify the world state. As the secondary application modifies the world state, it periodically and asynchronously sends the modified world state to the main application.
The various interactions between the secondary and main applications are typically implemented by reading and writing data to and from a main memory located in or near the host system, and various memories in the PPU architecture. Thus, proper memory management is an important aspect of this approach to generating physics-based simulations.
By partitioning the workload between the main and secondary applications so that the secondary application runs in parallel and asynchronously with the main application, the implementation and programming of the PPU, as well as both of the applications, is substantially simplified. For example, the partitioning allows the main application to check for updates to the world state when convenient, rather than forcing it to conform to the timing of the secondary application.
From a system level perspective, PPU 100 can be implemented in a variety of different ways. For example, it could be implemented as a co-processor chip connected to a host system such as a conventional CPU. Similarly, it could be implemented as part of one processor core in a dual core processor. Indeed, those skilled in the art will recognize a wide variety of ways to implement the functionality of PPU 100 in hardware. Moreover, those skilled in the art will also recognize that hardware/software distinctions can be relatively arbitrary, as hardware capability can often be implemented in software, and vice versa.
The PPU illustrated in FIG. 1 comprises a high-bandwidth external memory 102, a Data Movement Engine (DME) 101, a PPU Control Engine (PCE) 103, and a plurality of Vector Processing Engines (VPEs) 105. Each of VPEs 105 comprises a plurality of Vector Processing Units (VPUs) 107, each having a primary (L1) memory, and a VPU Control Unit (VCU) 106 having a secondary (L2) memory. DME 101 provides a data transfer path between external memory 102 (and/or a host system 108) and a VPEs 105. PCE 103 is adapted to centralize overall control of the PPU and/or a data communications process between PPU 100 and host system 108. PCE 103 typically comprises a programmable PPU control unit (PCU) 104 for storing and executing PCE control and communications programming. For example, PCU 104 may comprise a MIPS64 5Kf processor core from MIPS Technologies, Inc.
Each of VPUs 107 can be generically considered a “data processing unit,” which is a lower level grouping of mathematical/logic execution units such as floating point processors and/or scalar processors. The primary memory L1 of each VPU 107 is generally used to store instructions and data for executing various mathematical/logic operations. The instructions and data are typically transferred to each VPU 107 under the control of a corresponding one of VCUs 106. Each VCU 106 implements one or more functional aspects of the overall memory control function of the PPU. For example, each VCU 106 may issue commands to DME 101 to fetch data from PPU memory 102 for various VPUs 107.
As described in patent application Ser. No. 10/839,155, the PPU illustrated in FIG. 1 may include any number of VPEs 105, and each VPE 105 may include any number of VPUs 107. However, the overall computational capability of PPU 100 is not limited simply by the number of VPEs and VPUs. For instance, regardless of the number of VPEs and VPUs, memory bus bandwidth and data dependencies may still limit the amount of work that each VPE can do. In addition, as the number of VPUs per VPE increases, the VCU within each VPE may become overburdened by a large number of memory access commands that it has to perform between VPUs and external memory 102 and/or PCU 104. As a result, VPUs 106 may end up idly waiting for responses from their corresponding VCU, thus wasting valuable computational resources.
In sum, while increasing the complexity of a PPU architecture may potentially increase a PPU's performance, other factors such as resource allocation and timing problems may equally impair performance in the more complex architecture.