Microprocessor designers and manufacturers continue to focus on improving microprocessor performance to execute increasingly complex software, which delivers increased utility. While manufacturing process improvements can help to increase the speed of a microprocessor by reducing silicon geometrics, the design of the processor, particularly the instruction execution core, relates to processor performance.
Many microprocessors use instruction pipelining to increase instruction throughput. An instruction pipeline processes several instructions through different phases of instruction execution concurrently, using an assembly line approach. Individual function blocks such as a decode block, as a nonlimiting example, may be further pipelined into several stages of hardware, with each stage performing a step in the instruction decode process on a separate instruction. Thus, processor hardware pipelines can be deep with many distinct pipeline stages.
Another method to improve instruction execution speed is known as “out-of-order” execution. Out-of-order execution provides for the execution of instructions in an order different from the order in which the instructions are issued by the compiler in an effort to reduce the overall execution latency of the program including the instructions. One approach to out-of-order instruction execution uses a technique referred to as “register scoreboarding,” in which instructions are issued in-order, but executed out-of-order. Another form of out-of-order scheduling employs a technique known as “dynamic scheduling.” For a processor that provides dynamic scheduling, even the issue of instructions to execution hardware is rescheduled to be different from the original program order. The results of instruction execution may be available out of order, but the instructions are retired in program order. Yet, instruction pipelining in out-of-order techniques, such as dynamic scheduling, may be used separately or together in the same microprocessor.
Dynamic scheduling of parallel instruction execution may include special associative tables for bookkeeping instruction and functional unit status as well as the availability of a result of a particular instruction for usage as an input operand according to prescribed instructions. Scheduling hardware uses these tables to issue, execute, and complete individual instructions.
The scope of the dynamic scheduling of parallel instruction execution is instruction level parallelism (ILP), which has been extended to multiple threads (hyperthreading or simultaneous multithreading (SMT)). This technique provides hardware assisted dispatch and execution of multiple threads providing multiple instructions per clock issue to process in a parallel functional unit. Dynamic scheduling hardware provides simultaneous instruction issue from the multiple active threads.
Scheduling hardware may use scoreboards for the bookkeeping of thread and instruction status to trace dependencies and to define the moment of. issue and execution. In addition, threads may be suspended because of long latency cache misses or other I/O reasons. Nevertheless, as a nonlimiting example, the scoreboard may be comprised of an instruction status, a functional unit status, as well as a register result status. All three of these tables interact in the process of instruction execution by updating their fields each clock cycle. In order to pass the stage and change status of an instruction, certain conditions should be fulfilled and certain actions should be taken on each stage.
Register renaming is another technique that may be implemented to overcome name dependency problems when architecture registers namespace is predetermined, which enables instructions to be executed in parallel. According to a register renaming technique, a new register may be allocated each time an assignment is made to a register. When an instruction is decoded, the hardware checks the destination field and renames the architecture register name space. As a nonlimiting example, if register R3 is assigned a value, a new register clone R3′ may be allocated and all reads of register R3 in the following instructions are directed to clone R3′ (replacing architecture name by clone name).
In continuing this nonlimiting example, when a new assignment is made to register R3, another register clone R3″ is allocated and the following references are redirected to new clone R3″. This process continues with all input instructions. This process not only removes name dependencies, but it also makes the processor appear to have more registers and may increase the instruction level parallelism so that more parallel units may operate.
Register renaming may also be used by reorder buffers so as to extend the architecture register space and create multiple copies of the same register associate with different commands. This results in the ability to provide out-of-order with in-order completion.
When an instruction is decoded, it may be assigned a reorder buffer entry associated with the appropriate function unit. The destination register of the decoded instruction may be associated with the allocated reorder buffer entry, which results in renaming the register. The processor hardware may generate a tag to uniquely identify this result. The tag may be stored in the reorder buffer entry. When a subsequent instruction refers to the rename destination register, it may receive the value or the tag stored in the reorder buffer entry, depending upon whether or not the data is received.
A reorder buffer may be configured as a content addressable memory (CAM) where the tag is used for a data search. In application, a destination register number of a subsequent instruction may be applied to a reorder buffer and the entry containing this register number may also be identified. Once identified, the calculated value is returned. If the value has not been computed, the tag, as described above, may be returned instead. If multiple entries contain this register number, then the latest entry is identified. If no entries contain the required register number, then the architecture register file is used. When the result is produced, the result and tag may be broadcasted to all functional units.
Another processing approach involves real-time scheduling and multiprocessor systems. This configuration involves loosely coupled MIMD microprocessors, where each processor has its own memory and I/O channels. Several tasks and subtasks (threads) may run on these systems simultaneously. However, the tasks may include synchronization in some type of ordering to keep the intended processing pattern. Plus, the synchronization needed may be different for various processing patterns.
Unlike instruction level parallelism processors, real-time scheduling processors use processor assignment to task in threads (resource allocation). With the instruction level parallelism configuration, there may be specialized functional blocks with few of them duplicated, which means that instruction assignment for distribution is relatively simple depending upon the number of available slots and the type of instruction.
However, for multiprocessor systems of the MIMD type, all processors are typically similar and have a more complicated task assignment policy. At least one nonlimiting approach is to consider the MIMD structure as a processor pool, which means to treat the processor as a pooled resource and assign processes to processors depending upon availability of memory and computational resources.
There are at least two methodologies for distributing tasks and threads in this environment. The first is static assignment, which occurs when each type of task or thread is preassigned to a particular processor or group of processors. The second configuration is dynamic assignment, as similarly described above, which calls for tasks being assigned to any processor from the pool depending upon available resources and task priority. In this configuration, the multiprocessor pool may have special dispatch cues where tasks and threads are waiting for assignment and execution, as well as for I/O event completion. Also in this configuration, threads are parts of a task, and some of the tasks may be split into the several threads that may be executed in parallel with some synchronization on data and order. Thus, the threads in general may execute separately from the rest of the process. Also, an application can be a set of threads that cooperate and execute concurrently in the same address space but using different processors. As a result, threads running concurrently on separate processors may yield dynamic gain in performance.
In a multiprocessor configuration, thread scheduling may be accomplished according to load sharing techniques. Load sharing may call for the load being distributed evenly across the various microprocessors in the pool. As a result, this ensures that no microprocessor is idle.
Multiprocessor thread scheduling may also use some of the static scheduling techniques described above, such as when a thread is assigned to a specific processor. However, in assigning certain threads to a specific processor, other processors may be idle while the assigned processor is busy, thereby causing the assigned thread to sit idly waiting for its assigned processor to become free. Thus, there may be instances where static scheduling results in inefficiency in the processor.
Dynamic scheduling of processors may be implemented in an object oriented graphics pipeline. An object is a structured data item representing something travelling down a logical pipeline, such as a vertex of a triangle, patch, pixel, or video data. At the logical level, both numeric and control data may be part of the object, though the physical implementation may handle the two separately.
In a graphics model, there are several types of objects that may be processed in the data flow. The first is a state object, which contains hardware controlled information and shader code. Second, a vertex object may be processed, which contains several sets of vertices associated with numerical control data. Third, a primitive object may be processed in the data flow model which may contain a number of sets of primitives' associated numerical and control data. More specifically, a primitive object may include a patch object, triangle object, line object and/or point object. Fourth, a fragment object may be part of the data flow model which may contain several sets of pixel associated numerical and control data. Finally, other types of objects such. as video data may be processed in a data flow model as well.
Each type of object may have a set of possible operations that may be performed on it and a (logically) fixed data layout. Objects may exist in different sizes and statuses, which also may be known as levels or stages to represent the position they have reached in the process in pipeline.
As a nonlimiting example, the levels of an object may be illustrated on a triangle object, which initially has three vertices that point to the actual location of vertex geometry and attribute data. When the references are resolved (check caches and retrieve data from API buffers if needed), the object level is upgraded so that the object is sent through other stages. The level of upgrade normally may reflect the availability of certain data in the object structure for immediate processing. An upgraded level includes the previous level in most cases.
One of ordinary skill in the art would know that there may generally be two types of sizes (layouts) of an object. A first is a logical layout, which may include all data structures. The logical layout may remain unchanged from the moment of object creation through termination. A second type of layout for objects is a physical layout that shows the data structure is available for immediate processing, which operates to match the logical layout in the uppermost level.
Both the logical and physical layouts may be expressed in terms of frames and buffers—logical frames and physical buffers. Logical frames may be mapped to physical buffers to make data structures available for immediate processing. Each object initially may contain few logical frames and one of them may be mapped to a physical buffer. All other frames used in later stages may not be mapped so as to save memory resources on the chip. Yet both frames and buffers may have variable size with flexible mapping to each other.
An object may refer to data held within other objects in the system. Pipeline lazy evaluation schemes track these dependencies and use them to compute the value stored inside an object on demand. Objects of the same type may be processed in parallel independent cues. Alternatively, a composite object may be created containing several vertices, fragments, or primitives to process in SIMD mode.
For graphics processing applications, the features described above have historically included fixed function and programmable hardware based pipeline solutions. However, these linear solutions oftentimes lead to inefficiencies resulting from the static configuration of the graphics pipeline. When the bandwidth of a particular stage as described above does not change during the execution time of the frame generation, inefficiencies and idle time in the processor are introduced, thereby decreasing the overall efficiency. This inefficiency is compounded in an application involving multiple parallel processors.
Thus, there is a heretofore-unaddressed need to overcome the problem of dynamic creating and execution management of multiple logic graphic pipelines in an MIMD structure of parallel multithread processors. There is a further need for improved resource utilization in parallel processing to achieve higher performance, which may be previously attributed to poor allocation and scheduling protocol resolution.