This invention relates to computer technology, and to a processor and methods for performing outer product and outer product accumulate operations.
Communications products require increased computational performance to process digital signals in software on a real time basis. Increases in performance in the past twenty years have come through improvements in transistor technology and processor design. Transistor counts have doubled in accordance with Moore's law about every two years, increasing thousand fold from a few million to a few billion transistors per chip. Processor design has improved peak performance per instruction by architectural innovations that enabled effectively doubling datapath width about every four years, increasing from 32 bits (e.g. Intel's Pentium) to 1024 bits (e.g. Qualcomm's Hexagon HVX) over about the past twenty years.
Digital communications typically rely on linear algorithms that multiply and add data with 32 bits of precision or less. In fact, digital video and radio processing typically operate on 16 bit or even 8 bit data. As datapath width has increased far beyond these data widths, substantially peak usage has been maintained by partitioning operands and datapaths using a variety of methods, treated extensively, for example, in our commonly assigned U.S. Pat. Nos. 5,742,840; 5,794,060; 5,794,061; 5,809,321; and 5,822,603.
These patents describe systems and methods for enhancing the utilization of a processor by adding classes of instructions. These classes of instructions use registers as data path sources, partition the operands into symbols of a specified size, perform operations in parallel, catenate the results and place the catenated results into a register. These patents, as well as other commonly assigned patents, describe processors optimized for processing and transmitting data streams using significant parallelism.
In our earlier U.S. Pat. No. 5,953,241, we describe group multiply and sum operations (column 4 therein) which each one of four multiplier operands a, b, c, and d is multiplied by a corresponding one of four multiplicand operands e, f, g, and h to produce products a*e, b*f, c*g, and d*h. See, e.g. FIGS. 1 and 3 therein. We also describe a multiply and add operation in which operands i, j, k, and l are added to the products of the multiplications to produce results a*e+i, b*f+j, c*g+k, and d*h+l. See, e.g. FIGS. 2 and 4. These operations are described for both fixed-point and floating-point operands.
Others have developed a processor in which a vector-by-scalar multiply reduction is performed. See, e.g. The Qualcomm HVX architecture with SIMD extensions. This processor allows a group of four vector operands to be multiplied by one scalar operand with the four results being summed. See, e.g. FIG. 11, taken from http://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.24-Monday-Epub/HC27.24.20-Multimedia-Epub/HC27.24.211-Hexagon680-Codrescu-Qualcomm.pdf.
Emerging applications such as 5G communications, virtual reality, and neural networks, however, create an appetite for digital processing many orders of magnitude faster and more power efficient than these technologies. Moore's law is slowing as gate widths below 10 nm span fewer than 200 silicon lattice spacings. Advances in processor design are becoming more essential to accommodate the power performance needs of these applications.
Existing processor datapaths typically consume a small fraction of total processor power and area, so doubling their width doubles peak performance more efficiently than doubling the number of processor cores. There are practical constraints, however, on the number of doublings of the width of registers. The register complex typically comprises the central traffic interchange of the processor, operating at high clock rates. These registers have many input and output ports tightly coupled through a bypass network to multiple execution units. Wider execution units must avoid bottlenecks and sustain a large fraction of peak performance on targeted applications. These processor designs and methods must be capable of sustaining a large fraction of peak performance for algorithms needed by emerging applications such as 5G communications, virtual reality, and neural networks, yet at the same time be highly efficient in area and power.
Thus, there is a need for processor designs and methods that enable orders of magnitude increases in peak performance without greatly complicating the register complex. In particular many practical applications for such processors, e.g. machine learning and image processing, would benefit from a processor capable of performing an outer product. In an outer product each element of one vector is multiplied by each element of another vector. For example, given vectors U and V:
            U      →        =                                        u            1                    ⁢                      e            1                          +                              u            2                    ⁢                      e            2                          +                              u            3                    ⁢                      e            3                              ⇒              [                                                            u                1                                                                                        u                2                                                                                        u                3                                                    ]                        V      →        =                                        v            1                    ⁢                      e            1                          +                              v            2                    ⁢                      e            2                          +                              v            3                    ⁢                      e            3                              ⇒              [                                                            v                1                                                                                        v                2                                                                                        v                3                                                    ]            
The outer product of vectors U and V is:
            U      →        ⁢                  V        →            T        =                    [                                                            u                1                                                                                        u                2                                                                                        u                3                                                    ]            [                                                  v              1                                                          v              2                                                          v              3                                          ]        =          [                                                                  u                1                            ⁢                              v                1                                                                                        u                1                            ⁢                              v                2                                                                                        u                1                            ⁢                              v                3                                                                                                        u                2                            ⁢                              v                1                                                                                        u                2                            ⁢                              v                2                                                                                        u                2                            ⁢                              v                3                                                                                                        u                3                            ⁢                              v                1                                                                                        u                3                            ⁢                              v                3                                                                                        u                3                            ⁢                              v                3                                                        ]      