This invention relates to dynamic change of the load-balance in a multiprocessor system, particularly to dynamic change of the load balance between a host processor and a graphics adapter in a computer.
As regards polygon-based three-dimensional graphics, such as OpenGL and Direct3D, main factors which determine entire performance are as follows:
(1) API-the speed at which a graphics command is issued via API from an application;
(2) Geometry processing-the speed of geometry processing such as triangulation/coordinate transformations/lighting calculation;
(3) Setup processing-the speed of gradient calculation of color value/Z coordinate value/texture coordinate value along with face/edge of the triangle; and
(4) raster processing-the speed of generating pixels which are obtained by interpolation of color values, Z coordinate value, and texture coordinate value, and reading/writing them into a frame buffer.
The first listed factor, the API, does not present a problem since, even if a method is used whereby the API is called for each vertex (which is the worst case), it only takes a few tens of clocks per vertex.
Raster processing corresponds to how many. pixels can be drawn per second (pixel-rate). This pixel rate has nothing to do with a polygon rate (mentioned later), and a required amount is determined by screen size (how many pixels, for instance, 640xc3x97480 or 1024xc3x97768, a screen is composed of), frame rate (how many frames are displayed per second, which is different from a CRT refresh rate and is generally around 12-60 frames/second), and average overlapping on the screen (normally three times or so). For recently developed graphics adapters, raster processing presents almost no difficulty up to a screen size such as that of SXGA (1280xc3x971024 pixels).
Performance of geometry and setup processing, (2) and (3), directly corresponds to the number of polygons which can be processed per second (the aforementioned polygon rate). As setup processing is often considered a part of geometry processing, it is regarded as geometry processing here. Geometry processing requires lots of floating-point arithmetic. It takes a few hundred to a few thousand clocks for processing per vertex. Therefore, the throughput of a host processor alone is often insufficient. For instance, when processing 10M vertexes per, second, where 1,000 clocks are required to process each vertex, a processor which works at 10G clocks/second will be necessary. Thus, there are many cases where a functional unit dedicated to geometry processing is set on a graphics adapter. Also, the work load greatly varies depending on conditions of processing, such as number and types of light sources.
Meanwhile, a host processor stores a sequence of graphics commands in main storage device. This sequence of graphics commands is called a command queue. A graphics adapter obtains contents of a command queue by using DMA, followed by processing them and displaying them on a display device. This command queue must physically exist in main storage device or on a graphics adapter for the necessity of performing DMA transfer. Thus, the size of a command queue is limited. If this command queue becomes full or empty in the course of processing, the host processor or the graphics adapter stops so that the entire performance deteriorates. If the command queue is full, the host processor cannot write to the command queue any more, therefore it cannot go on to processing until there is a space in it. Also, if the command queue is empty, the graphics adapter cannot perform processing.
While a command queue does not become full or empty if the processing speed of the host processor and that of the graphics adapter are equal, it has not been possible to make both processing speeds equal for the following reasons:
(a) it is difficult to estimate throughput of a host processor available for graphics processing, since the type/operating frequency/number of host processors are various, and the load of a host processor which is available for uses other than graphics processing is difficult to estimate and changes dynamically;
(b) as in the case of the above-mentioned geometry processing, the work load of a graphics command on a host processor is difficult to estimate since it changes dynamically depending on a current state or data (for instance, the number of vertexes increase or decrease by clipping); and
(c) the work load of a graphics command on a graphics adapter is difficult to estimate since it changes dynamically depending on the current state or data.
Assuming that the throughput and work load of a host processor are Ph, Lh respectively and the throughput and work load of a graphics adapter are Pa, La respectively, processing can go on without a command queue becoming empty or full if Lh/Ph=La/Pa holds. However, Lh, Ph, La and Pa are all inestimable and the system""s performance could not always be fully exploited.
Japanese Published Unexamined Patent Application No. Hei 2-275581 discloses a technology for improving processing speed of the entire system, if the necessary time for using each function is known in advance, by changing the load on a plurality of processors every time a user switches on/off several functions which he or she uses. However, partial load shifting cannot be appropriately changed when the necessary time for performing a function depends on the data to be processed. Moreover, a host processor is often in a multitasking OS environment and the computational ability assigned to graphics changes every moment, which is also a point where the prerequisite of this patent (i.e., knowing the necessary time) is not appropriate. In addition, this patent requires a table of partial charges corresponding to all combinations of on/off of functions to be made, though such is not practical since the number of functions to be switched on/off is enormous in an actual environment.
Thus, an object of the present invention is to provide a computer system wherein Lh/Ph≅La/Pa in an environment where Lh, Ph, La and Pa are all unpredictable.
Another object is to enable the entire system""s performance to be best exploited by bringing it close to Lh/Ph=La/Pa.
A further object is to allow adaptation to improved throughput of a future host processor thereby extending product life.
Still another object is, even when a command queue becomes full, to keep a host processor from stopping so that the entire system""s performance does not deteriorate.
The foregoing and other objects are realized by the present invention which dynamically changes a partial charge, or assignment of processes, of each group in a sequence of processes from a first stage to an n-th stage in a computer having a plurality of processors, wherein said plurality of processors are grouped into at least two groups. The invention includes the steps of: detecting a change in a characteristic value in a queue for transferring a processing result between the groups; and changing the partial charge of each group based on the increase or decrease of the characteristic value. A characteristic value of data stored in a queue represents a value related to work load, and the queue seldom becomes full or empty if the load balance is changed by referring to this characteristic value. For instance, the characteristic value can be either the amount of information stored in a queue, the size (length) of a queue, or the number of vertex data stored in a queue in the case of processing related to graphics.
The aforementioned changing step may also comprise steps of determining if the characteristic value has increased by a predetermined threshold value or more and setting the charge of a group which performs processes up to an i-th stage (1xe2x89xa6i less than n), where the i-th stage is a boundary between partial charges of the groups, to processes up to a stage following the i-th stage. A process of a stage following the i-th stage means a process of the (i+1)-th stage or subsequent stage. Also, if the characteristic value has decreased by a predetermined threshold value or more, it is possible to execute a step of setting the charge of a group which performs processes up to an i-th stage (1 less than ixe2x89xa6n), where the i-th stage is a boundary between partial charges of the groups, to processes up to a stage preceding the i-th stage. A process of a stage preceding the i-th stage means a process of the (ixe2x88x921)-th stage or a preceding stage.
Also, if partial charges are dynamically changed in this way, a group which performs processes after a stage may need information telling it from what stage processing should be performed. Accordingly, a processing result may include information of the stage of processing which has been completed,
It is also possible to further comprise the steps of: examining whether or not usage of a queue has reached an upper bound; and if usage of the queue has reached an upper bound, a processor belonging to a group which performs processes up to an i-th stage (1xe2x89xa6i less than n), where the i-th stage is a boundary between partial charges (i.e., assigned processes) of the groups, retrieving a processing result of the tail end of the queue and storing a processing result in the queue after performing processes up to a stage following the i-th stage. By doing this, even when a command queue becomes full, a host processor can be kept from stopping so that the entire system""s performance does not deteriorate.
The computer which implemented the present invention comprises: a plurality of processors which can be grouped into at least two groups and on which a partial charge in a sequence of processes from a first stage to an n-th stage is set for each group; a queue for transferring a processing result between the groups; and a controller for detecting increase or decrease of a characteristic value in the queue and changing the partial charge of each group based on the increase or decrease of the characteristic value.
While the structure of the present invention was explained as above, the present invention may also be implemented by a program which performs each step. In such a case, the program may be stored on a storage medium such as a CD-ROM or a floppy disk or on a storage device or a device such as a hard disk or a ROM. It is also possible to implement a dedicated circuit or device which performs the processing of the present invention.