Moore's law describes a long-term trend in the history of computing hardware, in which the processing power of computers doubles approximately every 18 months. The capabilities of many digital electronic devices are strongly linked to Moore's law: processing speed, memory capacity, sensors, and even the number and size of pixels in digital cameras. In recent years, this has encouraged microprocessor makers to include more than one processor (i.e., core) in a single package to further enhance the processing speed and power of computing devices. In FIG. 1, for example, a Central Processing Unit 110 (CPU) contains a plurality of processors (Processor1 112, Processor2 114, Processor3 116, to ProcessorM 118). Each of the plurality of processors may also have a local memory attached to it (Memory1 122, Memory2 124, Memory3 126, . . . MemoryM 128). The plurality of processors may be embodied on separate substrates or may be embodied on a single substrate as processor cores, and the plurality of processors includes at least two processors.
For an M-processor machine, if an input data-stream could be evenly divided into M portions then assigned to each of M processors, in theory the same data-stream that a single-processor machine would take time T to decode will only take time T/M on an M-processor machine. In practice, however, there is always an overhead associated with splitting an input data-stream into M portions of data, distributing them to each processor, processing them in synchrony, and after processing, assembling the result into a single coherent output form.
Using data compression and encryption as an example, many data compression and cryptographic formats are optimized for serial transmission and processing. In order to save space, remove redundancies, and exploit some form of frame-to-frame coherency (i.e., where adjacent data items are related to each other rather than being completely random), an efficient form of compression or encryption typically produces a string of bits where later bits are determined by previous bits. For instance, an uncompressed numerical sequence “2 3 4 5 6” can be compressed as “2+1+1+1+1,” where only the initial value and the delta value (+1) is stored. As a result, to decompress such data, one must start from the very beginning (i.e., “2”) to get all the subsequent values. If the decoding machine starts randomly in the middle of the compressed data, it will only see a “+1” without knowing which base value such delta value is dependent on.
As another example, image compression is commonly used in reducing large volumes of data in digitized images for economical storage and for transmission via communication networks having limited bandwidth. For the purpose of illustration, consider a 2-dimensional image that is stored by using the Windows Bitmap File format (BMP) in an uncompressed form. Assume each pixel in the image takes one byte of storage and an image with dimensions W by H (W×H) is stored in a linear sequence as W bytes multiplied by H bytes. That is, the first row of W bytes is written first, followed by the second row of W bytes, and so on until all H rows are fully recorded. Processing this kind of uncompressed data in parallel by multiple processors for display is straightforward. For example, if there are exactly H processors available in a machine and the starting file position is FileStart, the machine can assign a starting position to each processor as follows:
                    Processor        ⁢                                  ⁢        1        ⁢                  :                ⁢                                  ⁢        FileStart                                          Processor          ⁢                                          ⁢          2          ⁢                      :                    ⁢                                          ⁢          FileStart                +                  (                      1            *            W                    )                                                  Processor          ⁢                                          ⁢          3          ⁢                      :                    ⁢                                          ⁢          FileStart                +                  (                      2            *            W                    )                                …                                    Processor          ⁢                                          ⁢          H          ⁢                      :                    ⁢                                          ⁢          FileStart                +                  (                                    (                              H                -                1                            )                        *            W                    )                    By doing so, each processor only reads W bytes of information starting from the starting position assigned to it so that the machine can read the whole image file in 1/H of the time it would have taken by using a single processor to read the same image data. It should be noted that in certain situations the machine with H processors can do the task in less than 1/H time. This factor here is merely used as a convenient placeholder metric of optimal performance of an algorithm.
Now consider an image that is stored in the compressed form of BMP. There are two different types: RLE8 (run length encoded 8-bit pixels) and RLE 4 (run length encoded 4-bit pixels). In either format data are re-encoded but no positional information is stored. Hence, when given an arbitrary position in a compressed BMP file, a processor cannot tell which pixel of the image is associated with that file position without having decompressed all the bytes preceding it. Namely, a multiple-processor machine may not process the compressed BMP file in parallel because it is unable to assign to each processor a file position showing which pixel of the image the processor should start with.
The Joint Photographic Experts Group (JPEG) is another well-known color image compression standard. A JPEG data-stream consists of metadata and image information encoded as a compressed entropy stream. At the basic level, the stream consists of well-defined segments that contain the metadata or indicate the beginning of the compressed data-stream. Although several different encoding and decoding methods are specified in the International Telecommunication Union specification, Baseline Sequential is the most often used in practice. For this specification, the image data is organized and stored as a continuous linear sequence of 8×8 blocks that are quantized further for lossy compression. As shown in FIG. 2, an input JPEG data-stream includes a plurality of Minimum Coded Units (MCU) 210, each of which includes a plurality of blocks 212. Each block 212 is an 8×8 array of coefficients that includes one DC coefficient 214 and sixty-three AC coefficients 216. Below is a brief description of how a Huffman-encoded JPEG decoder functions at a logical level to decompress a JPEG data-stream:                1. The decoder reads data from the JPEG data-stream byte by byte and determines if a JPEG marker is present. If so, the decoder reads only as many subsequent bytes as indicated by the marker.        2. Depending on the markers, the decoder loads and updates Huffman tables and Quantization tables as they are encountered.        3. The decoder decodes the Huffman-encoded entropy data in a way that produces an 8×8 block of coefficients.        4. The decoder then dequantizes these coefficients using the currently active Quantization table. The decoder processes the dequantized data by an inverse Discrete Cosine Transform (iDCT) function to produce the raw pixels of the image.        
JPEG data-streams are inherently sequential in nature as the DC coefficient of a block can only be determined after the DC coefficient of the previous block has already been decoded. This makes parallel processing of a JPEG data-stream extremely difficult and as a result, currently only one processor in a multiple-processor machine can be used to decode a JPEG data-stream.