As computer speed increased from 33 MHz to 1.0 GHz and beyond, the computer operations could not be completed in one cycle. As a result the technique of pipelining was adopted to make most efficient use of the higher processor performance and to improve throughput. Presently, deep pipelining uses as many as 15 stages or more. Generally, in a pipelined computing system there are several parallel building blocks working simultaneously where each block takes care of different parts of the whole process. For example, there is a compute unit (CU) that does the computation, an address unit including a data address generator (DAG) that fetches and stores the data in memory according to the selected address modes and a sequencer or control circuit that decodes and distributes the instructions. The DAG is the only component that can address the memory. Thus, in a deeply pipelined system if an instruction is dependent on the result of a previous one, a pipeline stall will happen where the pipeline will stop, waiting for the offending instruction to finish before resuming work. For example, if, after a computation, the output of the CU is needed by the DAG for the next data fetch, it can't be delivered directly to the DAG to be conditioned for a data fetch: it must propagate through the pipeline before it can be processed by the DAG to do the next data fetch. This is so because only the DAG has access to the memory and can convert the compute unit result to an address pointer to locate the desired data. In multi-tasking general purpose computers this stall may not be critical but in real time computer systems such as used in e.g., cell phones, digital cameras, these stalls are a problem. See U.S. patent application, entitled: IMPROVED PIPELINE DIGITAL SIGNAL PROCESSOR, by Wilson et al. (AD-432J) filed on even date herewith, herein incorporated in its entirety by this reference.
In one application bit permutation is used to effect data encryption. This can be done in the CU but the arithmetic logic units (ALU) in the CU are optimized for 16, 32, or 64 bit operations and are not efficient for bit by bit permutation. For example, if the permutation is done by the ALU, each bit requires three cycles of operation: mask, shift and OR. Thus, permuting a single 32 bit word requires 96 cycles or more.
In another approach instead of performing the permutations in the ALU, the permutation values can be stored in a lookup table located in external storage. However, now, the R register in the ALU must deliver the word e.g. 32 bits to a pointer (P) register in the DAG which can address the external memory lookup table. But this requires an enormous lookup table (LUT), i.e., 232 bits or more then 33.5 megabytes of memory. To overcome this, the 32 bit word in the R register in the ALU can be processed, e.g., as four bytes (8 bits) or eight nibbles (4 bits). This reduces the memory size required: for four bytes there is needed four tables of 256 entries, each of 32 bits (or a 4 Kbyte LUT) and for eight nibbles there is needed eight tables of sixteen entries, each of 32 bits (or a 512 byte LUT). But this, too, creates problems: now the ALU requires four (bytes) or eight (nibbles) to be transferred to the DAG's P register for a single 32 bit word. Each transfer in turn causes a number of pipeline stalls as discussed, supra.
In a separate but related problem linear feedback shift registers (LFSR's) e.g. CRC's, scramblers, de-scramblers, trellises encoding are widely used in communication systems. The LFSR operations can be scaled by the CU one bit at a time using mask/shift/OR cycles as explained above with the same problems. Or a specific hardware block, e.g. ASIC, FPGA that solves the LFSR problem using 4, 8, or 16 bits per cycle can be used. Both the mask/shift/OR approach in the CU and the ASIC approach can be eliminated by using an external lookup table or tables but with all the aforesaid shortcomings.