As processor technology advances, newer software code is also being generated to run on machines with these processors. Users generally expect and demand higher performance from their computers regardless of the type of software being used. Issues can arise from the kinds of instructions and operations that are actually being performed within the processor. Certain types of operations require more time to complete based on the complexity of the operations and/or type of circuitry needed. This provides an opportunity to optimize the way certain complex operations are executed inside the processor.
Media applications are drivers of microprocessor development. Accordingly, the display of images and playback of audio and video data, which are collectively referred to as content, have become increasingly popular applications for current computing devices. Such operations are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as single instruction multiple data (SIMD) registers. A number of current architectures also require multiple operations, instructions, or sub-instructions (often referred to as “micro-operations” or “μops”) to perform various mathematical operations or data transfer operations on a number of operands, thereby diminishing throughput and increasing the number of clock cycles required to perform these operations.
Masking is often used in SIMD or vectorization operations to enable a programmer to mask some part of the vectors. It is widely used for conditional operations, for the beginning/end of a vectorized loop, or for short vector support. Mask loads and stores of vector data are quite complex operations, typically requiring numerous individual instructions and clock cycles for execution. During such operations, some parts of the vectorized load/store operations (the “masked” parts) should not be executed at all. Since memory operations are typically done in blocks (e.g., load 128 bits; store 128 bits), it becomes quite challenging to support mask operations at a reasonable performance, as these block loads are done without reference to a mask.
Executing mask loads and stores using a processor architecture such as an Intel® Architecture (IA-32)-based processor is even more challenging due to misaligned loads, page/segmentation faults, data-breakpoint support, and so forth. For example, while doing 128-bit mask loads, part of the data can be located in one page while the other part can be located in another page. If one of the pages is not present, a page-fault should arise only if the part which belongs to this page is not masked. Thus, current architectures fail to adequately address efficient performance of mask load and store operations. Instead such techniques require numerous processing cycles and may cause a processor or system to consume unnecessary power in order to perform these masked operations.