Field of Invention
The present invention pertains to the computing sciences generally, and, more specifically to an apparatus and method of an execution unit for calculating multiple rounds of a Skein hashing algorithm
Background
FIG. 1 shows a high level diagram of a processing core 100 implemented with logic circuitry on a semiconductor chip. The processing core includes a pipeline 101. The pipeline consists of multiple stages each designed to perform a specific step in the multi-step process needed to fully execute a program code instruction. These typically include at least: 1) instruction fetch and decode; 2) data fetch; 3) execution; 4) write-back. The execution stage performs a specific operation identified by an instruction that was fetched and decoded in prior stage(s) (e.g., in step 1) above) upon data identified by the same instruction and fetched in another prior stage (e.g., step 2) above). The data that is operated upon is typically fetched from (general purpose) register storage space 102. New data that is created at the completion of the operation is also typically “written back” to register storage space (e.g., at stage 4) above).
The logic circuitry associated with the execution stage is typically composed of multiple “execution units” or “functional units” 103_1 to 103_N that are each designed to perform its own unique subset of operations (e.g., a first functional unit performs integer math operations, a second functional unit performs floating point instructions, a third functional unit performs load/store operations from/to cache/memory, etc.). The collection of all operations performed by all the functional units corresponds to the “instruction set” supported by the processing core 100.
Two types of processor architectures are widely recognized in the field of computer science: “scalar” and “vector”. A scalar processor is designed to execute instructions that perform operations on a single set of data, whereas, a vector processor is designed to execute instructions that perform operations on multiple sets of data. FIGS. 2A and 2B present a comparative example that demonstrates the basic difference between a scalar processor and a vector processor.
FIG. 2A shows an example of a scalar AND instruction in which a single operand set, A and B, are ANDed together to produce a singular (or “scalar”) result C (i.e., AB=C). By contrast, FIG. 2B shows an example of a vector AND instruction in which two operand sets, A/B and D/E, are respectively ANDed together in parallel to simultaneously produce a vector result C, F (i.e., A.AND.B=C and D.AND.E=F). As a matter of terminology, a “vector” is a data element having multiple “elements”. For example, a vector V=Q, R, S, T, U has five different elements: Q, R, S, T and U. The “size” of the exemplary vector V is five (because it has five elements).
FIG. 1 also shows the presence of vector register space 104 that is different that general purpose register space 102. Specifically, general purpose register space 102 is nominally used to store scalar values. As such, when, the any of execution units perform scalar operations they nominally use operands called from (and write results back to) general purpose register storage space 102. By contrast, when any of the execution units perform vector operations they nominally use operands called from (and write results back to) vector register space 107. Different regions of memory may likewise be allocated for the storage of scalar values and vector values.
Note also the presence of masking logic 104_1 to 104_N and 105_1 to 105_N at the respective inputs to and outputs from the functional units 103_1 to 103_N. In various implementations, only one of these layers is actually implemented—although that is not a strict requirement. For any instruction that employs masking, input masking logic 104_1 to 104_N and/or output masking logic 105_1 to 105_N may be used to control which elements are effectively operated on for the vector instruction. Here, a mask vector is read from a mask register space 106 (e.g., along with input data vectors read from vector register storage space 107) and is presented to at least one of the masking logic 104, 105 layers.
Over the course of executing vector program code each vector instruction need not require a full data word. For example, the input vectors for some instructions may only be 8 elements, the input vectors for other instructions may be 16 elements, the input vectors for other instructions may be 32 elements, etc. Masking layers 104/105 are therefore used to identify a set of elements of a full vector data word that apply for a particular instruction so as to effect different vector sizes across instructions. Typically, for each vector instruction, a specific mask pattern kept in mask register space 106 is called out by the instruction, fetched from mask register space and provided to either or both of the mask layers 104/105 to “enable” the correct set of elements for the particular vector operation.
FIGS. 3a through 3d pertain to a Skein hashing algorithm. FIG. 3a shows an exemplary high level processing flow of a Skein hashing algorithm 300. Typically, the Skein hashing algorithm is performed on pairs of 64 bit data chunks. Each 64 bit data chunk can be referred to as a “quadword”. In the exemplary high level processing flow of FIG. 3a, inputs 301a through 301h correspond to respective quadwords. That is, a first quadword is presented at input 301a, a second quadword is presented at input 301b, etc.
In the case of Skein-256, 256 input bits (4 input quadwords) are presented to the hashing algorithm. In the case of Skein 512, input 512 bits (8 input quadwords) are presented to the hashing algorithm. In the case of Skein 1024, 1024 input bits (16 input quadwords) are presented to the hashing algorithm.
FIG. 3a shows an example of a Skein-512 algorithm. As observed in the example of FIG. 3a, a first “subkey addition” is performed 300 on the initial input quadwords 301a-h. A subkey addition is the addition of a numeric value equal in size to a value represented by the quadwords presented to it. For example, in the case of Skein_512, eight quadwords are used to construct a 512 bit value. As such, the subkey is also 512 bits and is added directly to the value represented by the eight quad words of internal state. The value of a subkey, and/or its method of calculation is readily available to those of ordinary skill and need not be discussed here.
According to the flow diagram of FIG. 3a, a “round” consists of a “mix” operational level 302 followed by a “permute” operational level 303. A single mix operation is performed on pairs of quadwords. As such, for Skein-512, four individual mix operations 302a through 302d are performed at the mix operational level 302 to build an internal state of 512 bits. The permute operation level 303 shuffles the outputs of the mix operations. An example of a Skein-512 permute pattern is observed in FIG. 3b (Skein-256 and Skein-1024 have their own permute patterns).
A sequence of four rounds 304a, 304b, 304c and 304d is followed by another subkey addition 305, and, the process of four rounds followed by a subkey addition repeats (e.g., 18 total times) until a preset total number of rounds is computed (e.g., 72 total rounds).
FIG. 3c shows a mix operation. As observed in FIG. 3c, a left input quadword 310a is added to a right input quadword 310b to produce a left output quadword 311a. The right input quadword 310b is also rotated 312. The left output 311a is XORed with the rotated right input quadword to produce a right output quadword 311b. The amount of rotation that is applied to the right input quadword 310b is a function of the specific round being executed and where the quadword resides in across the set of quadwords that make up the algorithm's internal state (e.g., 512 bit internal state for the Skein 512 algorithm). FIG. 3d shows an embodiment of a scheme used to determine the rotation.