Microprocessors are often required to manipulate binary data having wide ranges of bit lengths, for example data that ranges from a single logic bit to high precision arithmetic operations involving data that may be more than 128 bits in length.
Hardware arithmetic logic units (ALUs) within microprocessors are generally constructed and arranged to handle fixed operand lengths. As a result, high precision arithmetic operations require multiple program steps and multiple microprocessor cycles. These data processing conditions lead to programs that are inefficient in terms of execution time because the microprocessor hardware and the supporting program instruction set are not optimized for operating on data having a wide range of operand lengths.
This inefficiency results in large part from repeated stores and loads to memory as well as software loop control overhead (compares, branches, etc.) For complex operations such as multiplication and operations involving extended-precision algorithms the overhead is even more pronounced. In addition, the sign (negative or positive) and zero status of an arithmetic result must be handled separately for multi-word calculations, requiring even more processor time to complete the operation.
Digital Signal Processors (DSPs) are high-speed microprocessors optimized to carry out large numbers of arithmetic operations in a short period of time. As the name implies, DSPs were developed to carry out real-time processing (e.g., filtering, compression, encryption) algorithms on digital signals. DSPs incorporate performance-optimized arithmetic structures not found in conventional microprocessors. Among the devices incorporated into DSPs to improve number-crunching performance, two of the most important are the hardware multiplier block and the barrel shifter.
As with most structures in a DSP, hardware multiplier blocks are high-performance, speed-optimized structures. Unfortunately, DSP hardware multiplier blocks require large amounts of chip surface area for specialized signal-processing circuitry. In addition, an increase in the size of the multiplier block generally requires a concomitant increase in the size of a CPU""s data buses and arithmetic logic units (ALUs). For example, the multiplication of two 16-bit numbers gives a 32-bit result. A multiplier block able to handle this 16-bit multiplication would generally require 32-bit data buses, arithmetic logic units (ALUs), and accumulators to accommodate the result. This adds considerably to the complexity, and therefore the cost, of a microprocessor. It is also desirable, especially in signal processing applications, to have the capability to multiply numbers of more than one word in length. Unfortunately, the complexity and size of a multiplier block quickly grows unmanageable as the size of its operands increases.
Conventional microprocessors often do not incorporate a hardware multiplication block or barrel shifter, as the size and cost is prohibitive, especially in the cost-competitive consumer market. In a conventional lower-end processor, numbers are multiplied by repetitive additions in a multi-bit adder, a relatively slow process. A typical multiplication carried out in software requires a long series of shifts and adds, requiring a great deal of processor time. From a speed standpoint, it would be highly desirable to have the performance and functionality of a hardware multiplier block built into a low-cost microprocessor, but the complexity and space requirements of the related hardware have traditionally made such a design cost-prohibitive.
A barrel shifter is another speed-optimized DSP structure extremely useful for large calculations. Barrel shifters are designed to shift a number several bit positions in a single operation. Although a barrel shifter is similar in function and structure to a multiplier block, it is conventionally designed in as a completely separate structure on the chip. In binary arithmetic, a shift left of one bit position equates to a multiply by two. A left shift of N bits is equivalent to a multiply by 2N.
Conventional low-end processors do not incorporate barrel shifters, but carry out barrel shifting through a series of single position shifts, each shift comprising a single software operation. Barrel shifting is a highly desirable function in a processor, but the specialized circuitry required to carry out this function takes up too much space on a chip to be included in low-cost processors. The barrel-shifting function could be carried out by a multiplier block if the numbers were encoded properly, but if the encoding must be implemented in software, the overhead significantly reduces any performance gains from a hardware barrel shift.
It is known that prior microprocessors have included the capability of operating on chains, for example by repeating a given instruction a prescribed number of times. It is known that a repeat add with carry will execute a chain operation where the data memory address of the operands and the result are automatically incremented after each operation. It is also known that others have used fixed hardware multipliers to do extended precision multiplies by using a multiply-by-parts algorithm, a complex and relatively inefficient solution.
As mentioned, microprocessors typically must manipulate operands of differing, sometimes widely differing, lengths. Operand lengths can vary from a single logic bit to 512 bits or more. Arithmetic logic units (ALUs), on the other hand, have a fixed width. Where high-precision arithmetic is necessary, requiring operands longer than the ALU, the processor must execute the operation in multiple steps. The programs become inefficient in terms of execution time and programming code efficiency because the basic hardware and supporting instruction set are not optimized for operating on extended-precision data represented by a sequential chain of data words.
Where a number that is one word in length must be multiplied by a number greater than one word in length (chain multiplication), the process can be extremely cumbersome even in designs incorporating a multiplier block. Such a multiplication is carried out by a series of one-word sub-multiplications beginning with the least significant word. For each sub-multiplication, the code must instruct the processor to (1) carry out the single-word multiplication (either in hardware or software), (2) store the lower word of the result to memory, (3) move the higher word of the result to the correct operand register for the next sub-multiplication, (4) check to see if the operand chain is complete, and if not, (5) loop back to start the process over again. This process carries with it a heavy amount of instruction-decode and data-shifting overhead, slowing the multi-word multiplication process down considerably.
The need remains in the art for an enhanced multiplier with specialized architecture to address the problem of operating on long, multiple word length data in an efficient, consistent and unified manner.
This application discloses a multiplier block making use of a xe2x80x9cchainingxe2x80x9d device and integral barrel shift circuitry to increase arithmetic throughput while avoiding many of the liabilities of DSP logic. In order to avoid the inclusion of a double-width (32-bit) ALU and accumulator and yet still accommodate the double-precision (32-bit) result of the multiplier, the 32-bit product of the static multiplier block is split into two parts. The lower 16 bits pass directly to the ALU for immediate use (transfer, accumulate, etc.). The upper 16 bits are latched into a 16-bit register (known as Product High) for later use. The Product High register can be ported directly back to the multiplier block for continuous, multi-word (chain) multiplication. This optimization is totally consistent with the single data memory, single data bus designs found in most microprocessors, which require two cycles to set up two multiplier operands.
This invention optimizes the design of the multiplier block to minimize cost and increase flexibility, so that it can be included in very low cost microprocessors and single chip microcomputers. The basic function of the multiplier block is the execution of a single-cycle 17-bitxc3x9717-bit multiplication yielding a 34-bit product. In the preferred embodiment, the operand can be interpreted as either signed or unsigned. The multiplier incorporates the capability to do a one-word (16-bit) by N-word chain (up to 496-bits) multiplication yielding up to a 512-bit result.
In addition, the multiplier can provide a general purpose barrel shift function. A 4-bit shift value register with a 4-to-16 bit decoder allows the multiplier block to do a 1- to 16-bit barrel shift on either a 16-bit operand or an N * 16-bit chain operand. The long chain barrel shift is unique to multiplier block functions.
The relatively small area needed for the multiplier block and barrel shift circuitry, resulting from the elimination of specialized DSP structures such as look-ahead logic and pipeline circuitry, allows the processor to incorporate much of the functionality and performance of a digital signal processor in a low-cost product.
The multiplier block disclosed in this application is incorporated into a low-cost microprocessor suitable for incorporation into consumer products. The microprocessor and its specialized instruction set provide efficient data processing on data types ranging from one bit to 512 bits in length. The preferred embodiment incorporates specialized hardware structures and specific instruction set enhancements that facilitate operations on multiple word width data in an efficient, consistent, and unified way. Every instruction word that manipulates data has a reserved bit switch that will cause the instruction to be executed either once (operating on single word data) or as a repeated execution of the same instruction (operating on a chain or list of sequential data).
According to the preferred embodiment, several hardware structures are necessary to support this extended precision instruction set definition. First, a hardware chain register with counter was included to control the repeat count and number of words in the chain. Second, a register file was implemented to provide the accumulation function for the ALU. Third, specialized address control was necessary to control the sequential acquisition of up to two input operand chains and one output operand chain. Fourth, extra ALU status logic was included to handle arithmetic and logical status in a unified way independent of data width. Fifth, the product high register was routed to the partial sum input of the hardware multiplier to enable a consistent chain multiply function.
The preferred embodiment provides a one-word length ALU (31 of FIG. 1), provides the concept of word chains, and in the preferred embodiment provides for the ability to specify four different storage areas that each store a pair of chain values, thus overcoming disadvantages of prior microprocessors.
The disclosed embodiments look at data in a consistent way and have very few limitations on the number of bits used to represent data. Software instructions are essentially identical for operating on either single 16-bit words or extended 512-bit word chains. Execution time is extended linearly with the number of words in the operands, with little or no software overhead. The code is compact, logical and easy to understand.
These and other features and advantages of the innovative processor hardware will be apparent to those of skill in this art upon reference to the following detailed description of preferred embodiments of the invention, which description makes reference to the drawings.