1. Field of the Invention
This invention relates to performing operations on block operands.
2. Description of the Related Art
Blocks of data are typically transmitted and/or processed as a single unit in a computer or network system. While block size is typically constant within any given system, different systems may have block sizes that range from a few bytes to several thousand bytes or more. There is a tendency for block size to increase with time, since advances in technology tend to allow larger units of data to be transmitted and processed as a single unit than was previously possible. Thus, an older system may operate on 32 byte blocks while a newer system may operate on 4 Kbyte blocks or larger.
In computer and network systems, many situations arise where it is useful to perform operations on blocks of data. For example, a RAID storage system that implements striping may calculate a parity block for each stripe. Each stripe may include several blocks of data, and the parity block for that stripe may be calculated by XORing all the blocks in that stripe. Another block operation may reconstruct a block that was stored on a failed device by XORing the parity block and the remaining blocks in the stripe. Similarly, in graphics processing, operations are often performed on multiple blocks of data. These block operations may be implemented in a system's main processor or controller. However, block operations like these are often implemented in dedicated hardware, leaving general processors and controllers free to tend to other operations and often improving the performance of the block operations.
Given the large amounts of data involved, block operations tend to consume large amounts of bandwidth. Returning to the parity example, if there are 5 blocks (B0-B4) of data in a particular stripe, the parity P for that stripe may equal B0 XOR B1 XOR B2 XOR B3 XOR B4. A RAID controller may be configured to calculate P using four instructions of the form A=A XOR Bn, where an accumulator A stores intermediate results:                (0) A=B0        (1) A=A XOR B1        (2) A=A XOR B2        (3) A=A XOR B3        (4) A=A XOR B4        (5) P=A        
Note that in steps 1-4 of the example, the accumulator A stores both an operand and a result. Accordingly, performing each of these steps involves both a read from and a write to the accumulator. Furthermore, since the operands for each step are blocks of data, each step 1-4 may represent multiple sub-steps of byte or word XOR calculations (the size of the sub-step calculations may depend on the width of the functional unit performing the XOR calculation). For example, if each block is 4 Kbytes, step 1 may involve (a) receiving a word from the accumulator and a word of B1, (b) XORing the two words to get a result word, (c) overwriting the word received from the accumulator in step a with the result word, and (d) repeating a-c for the remaining words in block B1. As this example shows, performing a multi-block operation may involve alternating between a read and a write to the accumulator during each sub-step. Each of these reads and writes takes a certain amount of time to perform, and there may be an additional amount of time required to switch between read and write mode (e.g., time to precharge an output driver, etc.). Since each sub-step involves both a read and a write, the accumulator memory may not be able to keep up with the full bandwidth of the memory that is providing Bn unless the accumulator is capable of being accessed at least twice as fast as the memory storing Bn. If the accumulator cannot keep up with the memory that stores Bn, the accumulator will present a bottleneck.
An additional concern that may arise when using an accumulator is that as the bytes of the result are written, the result bytes overwrite the operand bytes already stored in the accumulator. Thus, the previous value of A is lost during each step. If an error occurs as one of the block operands Bn is being transmitted or during a step of the XOR calculation, an erroneous result may overwrite the previous value of the operand. When the error is detected, the entire operation may have to be redone, beginning at step 0.
Thus, accumulators used when performing block operations such as a parity calculation may create a performance bottleneck. For example, if the rate at which an accumulator memory can both provide an operand and store a new result is less than the rate at which the other operand (e.g., Bn) can be provided from its source, the accumulator memory will limit how quickly the operation can be performed. One possible way to alleviate such an accumulator bottleneck is to include specialized components in the accumulator memory. For example, if a memory that can be read from and written to at least twice as fast as the source of Bn is used for the accumulator memory, the accumulator memory may be able to keep up with the Bn source. However, such a memory may be too expensive to be practical. Additionally, such an accumulator memory may be inefficient. Generally, operations that are performed on large groups of data may be inefficient if they frequently switch between reading and writing data. For example, instead of allowing data to be transmitted in bursts, where the costs of any setup and hold time and/or time required to switch between read and write mode are amortized over the entire burst, frequently switching between reads and writes may result in data being transmitted in smaller, less efficient units. Accordingly, if the multi-block operation is being performed one word at a time, it may be necessary to repeatedly alternate between reading from and writing to the accumulator, reducing the accumulator's efficiency. As a result of this inefficiency, the memory may need to be more than twice as fast as the source of the other operand to avoid presenting a bottleneck.
Another solution to the accumulator bottleneck problem may be to use a specialized memory such as a dual-ported VRAM (Video Random Access Memory) for the accumulator in order to increase the bandwidth of the operation. Dual-ported VRAM can be read from and written to in the same access cycle. This may alleviate the accumulator bottleneck and allow the block operation to be performed at the speed that operand B can be fetched from its source.
While the dual-ported memory may alleviate the accumulator bottleneck, there are still several concerns that may arise when using a special-purpose memory. For example, special-purpose memories tend to be significantly more expensive than general-purpose memory devices. Additionally, special-purpose memories are more likely to be discontinued than general-purpose memories. There is also a greater possible that upgraded versions of special-purpose memories may not be available in the future. Both of these tendencies may limit the viability of special-purpose memories as a long-term design solution. Also, because they are special-purpose, these memories may be available from fewer vendors than general-purpose devices, making suitable memories difficult to locate and obtain.