The invention relates generally to the field of systems and computer-implemented methods for generating transforms of vectors, and more specifically to systems and computer-implemented methods for efficiently generating Walsh transforms.
The Walsh transform is used in a number of areas such as image processing, communications and the like as a fast way to generate an approximation to the fast Fourier transform (xe2x80x9cFFTxe2x80x9d). Recently, the Walsh transform has also been used in cryptography and in testing the randomness of sequences of pseudo-random numbers, for which it is often necessary to generate the Walsh transform of large data sets, sometimes on the order of a billion or more data items. Accordingly, it is desirable to be able to generate a Walsh transform as efficiently as possible.
Generally, the Walsh transform of a data set f(x) containing xe2x80x9cNxe2x80x9d data items is defined as:                                           W            ⁡                          (              u              )                                =                                    1              N                        ⁢                                          ∑                                  x                  =                  0                                                  N                  -                  1                                            ⁢                                                f                  ⁡                                      (                    x                    )                                                  ⁢                                                      ∏                                          i                      =                      0                                                              n                      -                      1                                                        ⁢                                      -                                          1                                                                                                    b                            i                                                    ⁡                                                      (                            x                            )                                                                          ⁢                                                                              b                                                          n                              -                              i                              -                              1                                                                                ⁡                                                      (                            u                            )                                                                                                                                                                                  ,                            (        1        )            
where bi(x) gives the ith bit of xe2x80x9cx.xe2x80x9d The Walsh transform can be generated for data sets for which xe2x80x9cNxe2x80x9d is a power of two. The usual practice is to view the data set as a vector comprising N elements, and to generate the transform using N/2 radix-two butterflies organized in Log2(N) stages, with each radix-2 butterfly being a pair of add- and subtract operations, as follows:
(1) temp1=f(i1)+f(i2)
(2) temp2=f(i1)xe2x88x92f(i2)
(3) W(i1)=temp1
(4) W(i2)=temp2
where, at any stage, f(i1) and f(i2) are i1-th and i2-th components of the input vector or output of the previous stage, and W(i1) and W(i2) are the i1-th and i2-th components of the output of the current stage. Thus, in a computer in which the processor is constructed according to the xe2x80x9cload-storexe2x80x9d architecture, each butterfly requires two loads from memory (retrieving f(i1) and f(i2) for use in lines (1) and (2)), two arithmetic operations (the addition and subtraction operations in lines (1) and (2)) and two memory storage operations (lines (3) and (4)), or six operations in total. Since there are N/2 butterflies in each stage, the total number of operations per stage is 3N. Further, since there are Log2N stages, to generate a Walsh transform for a vector of length xe2x80x9cNxe2x80x9d components using radix-two butterflies, the processor would need to perform 3N Log2N operations. On a computer system capable of performing one memory access operation concurrently with an arithmetic operation during each processing cycle, during processing of each butterfly, two memory load operations can be performed in parallel with two arithmetic operations, and therefore the total number of processing a cycles required to perform a radix-two Walsh transform is 2N Log2N It will be appreciated that, in a computer the processor can over-write the input vector with the output Walsh transform vector in memory, thereby reducing the amount of storage space required for the Walsh transform operation.
The number of operations required to be performed to generate a Walsh transform can be reduced significantly if higher-radix butterflies are used. If, for example, a radix-4 butterfly
(1) x1=f(i1)+f(i2)
(2) x2=f(i1)xe2x88x92f(i2)
(3) x3=f(i3)+f(i4)
(4) x4=f(i3)xe2x88x92f(i4)
(5) y4=x1+x3 
(6) y2=x1xe2x88x92x3 
(7) y3=x2+x4 
(8) y4=x2xe2x88x92x4 
(9) W(i3)=y1 
(10) W(i2)=Y2 
(11) W(i3)=Y3 
(12) W(i4)=y4 
is used, the Walsh transform would be generated using Log4N stages, with each stage containing N/4 butterflies. In that case, each butterfly would require eight memory accesses (that is, load and store operations, reflected in lines (1) through (4) and (9) through (12)) and eight arithmetic operations (reflected in lines (1) through (8)), requiring 4N Log4N (which corresponds to 2N Log2N) operations for all of the butterflies to generate the entire transform. On a computer system capable of performing one memory operation concurrently with an arithmetic operation, the total number of processing cycles required to perform the radix-four Walsh transform is 2N Log4N.
Similarly, if a radix-8 butterfly
(1) x1=f(i1)+f(i2)
(2) x2=f(i1)xe2x88x92f(i2)
(3) x3=f(i3)+f(i4)
(4) x4=f(i3)xe2x88x92f(i4)
(5) x5=f(i5)+f(i6)
(6) x6=f(i5)xe2x88x92f(i6)
(7) x7=f(i7)+f(i8)
(8) x8=f(i7)xe2x88x92f(i8)
(9) y1=x1+x3 
(10) y2=x1xe2x88x92X3 
(11) y3=x5+x7 
(12) y4=x5xe2x88x92x7 
(13) y5=x2+x4 
(14) y6=x2xe2x88x92x4 
(15) y7=x6+x8 
(16) y8=x6xe2x88x92x8 
(17) W(i1)=y1+y3 
(18) W(i2)=y5+y7 
(19) W(i3)=y2+y4 
(20) W(i4)=y2+y5 
(21) W(i5)=y1xe2x88x92y3 
(22) W(i6)=y5xe2x88x92y7 
(23) W(i7)=y2xe2x88x92y4 
(24) W(i8)=y6xe2x88x92y8 
is used, the number of operations is (Log8N)(N/8)(24 arithmetic operations+16 memory accesses), or 5N Log8N operations. Similarly to the case with a radix-four butterfly, as described above, on a computer system capable of performing one memory access concurrently with an arithmetic operation the total number of processing cycles required to perform the radix-eight Walsh transform is 3N Log8N. This corresponds to the number of processing cycles required for the radix-four Walsh transform, but in the radix-eight Walsh transform the difference in time between the time the data are loaded and the time they are used in processing is larger than in the case of the radix-four Walsh transform, and so the radix-eight Walsh transform can generally be implemented more efficiently.
Generally, use of higher-radix butterflies can further reduce the number of operations required to be performed to generate a Walsh transform. In addition, depending on the architecture and internal resources of the particular processor, such as the number of registers and the size of its cache, typically the processor will be able to reduce the number of operations for higher-radix butterflies. It will be appreciated, however, that beyond a radix, the number of results that would need to be stored internally (generally, the yn values in the descriptions above) in order to take advantage of the reduced number of operations would be greater than the internal resources available. When that occurs, those results would need to be stored externally of the processor, resulting in a leveling off of the advantage that might come from higher-radix butterflies.
The invention provides a new and improved system and computer-implemented method for efficiently generating Walsh transforms of input vectors.
In brief summary, the invention provides a system for generating a Walsh transform output vector from an xe2x80x9cNxe2x80x9d-component input vector includes a vector store, a plurality of Walsh transform kernels and a control module. The vector store is configured to store the input vector The Walsh transform kernels are configured to generate a Walsh transform of a predetermined radix, with at least two of the Walsh transform kernels generating respective Walsh transforms of different radices A and B, B less than A. The control module is configured to determine a factorization N=AaBb, and, in each of xe2x80x9caxe2x80x9d stages associated with the radix-A Walsh transform kernel, and xe2x80x9cbxe2x80x9d stages associated with the radix-B Walsh transform kernel, determine a stride value for the stage, and in each of several iterations, use the stride value to select from the vector store ones of the vector components to be processed during the iteration, use the one of the radix-A or radix-B Walsh transform kernel associated with the stage in connection with the selected vector components, and store the result in the vector store.