Conventional graphics processing typically involves the transfer of rectangular blocks of image data from a source memory to a destination memory while preparing the image data for display. These block transfers, also referred to as BLTs, typically result in the transfer of data from system memory or a cache to video memory as part of the rasterization process. However, graphics hardware typically does not have access to the translation lookaside buffers (TLBs) and page tables necessary to translate virtual addresses to physical addresses for use in accessing the image data from the system memory. Accordingly, the task of performing BLTs typically is assigned to a central processing unit (CPU) that has ready access to the TLBs and page tables. In conventional systems, the CPU utilizes a software loop to access each line of the source memory and then transfer the data at the accessed line to the corresponding line of the destination memory. Due to this software loop, block transfers typically present a significant burden for the CPU. Moreover, in many instances, the source memory and the destination memory may have different pitches, or widths, that result in a misalignment of the transferred block. Accordingly, an improved technique for performing block transfers would be advantageous.