1. Technical Field
The inventive concept relates to a systolic data processing apparatus and method, and more particularly, to an apparatus and method for systolic data processing while executing a parallel graphic shader program.
2. Discussion of Related Art
A Graphic processing unit (GPU) is a hardware structure for performing graphic rendering configured in a form of pipeline for fast performance of a fixed type rendering process, and improved parallelism thereof. Graphics rendering systems typically assume that objects to be rendered are made of polygons such as triangles. In the graphics system, each polygon is input to a graphics pipeline one vertex at a time. The GPU assembles the vertices into the corresponding polygons as needed to render the corresponding object. Since the advent of GPUs, support for various forms of shading effects on the vertexes has become an important point of the rendering technology. Accordingly, the GPU has developed into processor capable of executing a given shader program instead of the fixed type pipeline.
Furthermore, for improving performance, a parallel processing architecture in a form of single instruction multi data (SIMD) architecture or multiple instruction multiple data (MIMD) architecture may be employed. Particularly, a shader processor structure including a single program multiple data (SPMD) structure becomes a key technology for GPU operation, where the SPMD structure can improve parallelism with low control costs. To this end, a SIMD type hardware structure may be employed, to which the SPMD is most easily and effectively applicable. When executing a plurality of shader programs in the SIMD structure, a plurality of processing elements included in an array access identical operation resources such as Special Function Units (SFU) and Texture Caches (TC). In this time, efficiency of usage of the limited operation resources can be maximized only if each of the processing elements delays an execution time point in a systolic fashion. Furthermore, once a task switching method is adopted in the SIMD operation, an operation delay time due to cache misses can be substantially removed, however space is needed for storing operation statuses of the multiple threads. When the systolic scheme is applied, new threads can be allocated respectively to processing elements in which the cache misses have occurred without saving the operation statuses of the threads. Thus, substantially the same level of performance can be achieved without additional storage space.
In executing instructions such as general arithmetic logic unit (ALU) instructions, special instructions, texture instructions, etc., by the existed SIMD-based shader processor, the special instructions and texture instructions occupy only about 10% and 5%, respectively, of the shader program. In order to execute the special and texture instructions, however, all of the processing elements access the operation resources at the same time. Since the SPUs perform complicated mathematical functions, such as sin, cosine, log, sqrt, etc., the implementation of the SPUs is complicated. The implementation of TCs is costly, since a single cache has a single port in a cache memory.
Furthermore, each of a plurality of processing elements included in an SIMD structure executes a plurality of threads. When a cache miss occurs in at least one of the processing elements during execution of the shader program, operating statuses of all threads currently being executed are stored, and new threads are fetched and executed. Here, an additional storage space is needed for storing the operating statuses.
When a corresponding processing element stops operations of a cache-missed thread and waits until operations of all remaining threads are completed in order to save the storage space, partial processing element usage is significantly lowered, and thus overall performance significantly decreases.
Therefore, a need exists for a systolic data processing in a parallel graphic shader program.