The present invention relates in general to data transfer devices, and in particular to systems and methods for synchronizing data between two buses such that one bus can be used to deliver a portion of a data stream that is primarily being delivered by the other bus.
Graphics processors typically include a rendering pipeline made up of several sections, e.g., a front-end section, a vertex processing engine, a rasterizer, a fragment shader, and so on. The processor provides buses for transferring various data streams from one section of the pipeline to another. For example, in one common architecture, all rendering instructions and associated data are received as commands in a front-end processing section. In response to the instructions, the front-end section may fetch vertex data from system memory and generate a stream of state commands (also referred to herein as state bundles); the vertices and state commands are delivered to the vertex processing engine via one or more buses.
In general, the vertex data stream and the state commands stream have very different properties and are not required to be synchronized. For instance, a state command typically contains significantly fewer bits than a vertex. In addition, the state command stream may occasionally stall, e.g., if one of the commands requires flushing the vertex processing engine (or other rendering pipeline components); such command stream stalls do not need to affect the flow of vertex data into input buffers of the vertex processing engine. These considerations suggest providing separate buses for state commands and vertex data. The vertex bus is optimally wide enough (e.g., 128 bits) to transmit a vertex attribute on each clock cycle and includes little or no buffering; the state command bus is optimally narrower (e.g., 42-48 bits) with a relatively deep FIFO (first-in, first-out buffer circuit) to accommodate stalls in the state command stream.
In modern processors, the rendering pipeline often includes programmable components (e.g., vertex processing engine, fragment shader), and the need to deliver programs as well as other state commands to such components complicates the dual-bus design. The programs are, in effect, state commands that should be included in the state command stream so that they can be kept in order with respect to other state commands. Programs, however, are considerably wider than typical state commands (e.g., 128 bits), making them ill-suited for delivery via a narrow command bus. Making the command bus wide enough to transfer a program in a reasonable number of cycles will result in an increase in chip area due to the need for a FIFO that is both deep and wide. Since, in practice, programs comprise a relatively small fraction of state commands, much of the FIFO capacity would rarely be used, resulting in inefficiency.
Another option is to deliver both state commands and data via a single relatively wide (e.g., 128+48 bits) bus. While this option would result in more efficient use of the bus width, it can reduce overall throughput, as vertex data transfer would also stall whenever the state command stream stalls.
It would therefore be desirable to provide a data transfer mechanism for efficiently accommodating programs or other unusually wide state commands.