SHA stands for Secure Hash Algorithm. It consists of five hash functions designed by the National Security Agency (NSA) and published by the National Institute of Standards and Technology (NIST). One of them is SHA-2. SHA-2 is a set of secure hash functions including SHA 224, SHA 256, SHA 384, and SHA 512 developed by the NSA intended to provide a higher level of security than the SHA-1 algorithm. SHA 224 and SHA 256 are similar algorithms based on a 32 bit word length producing digests of 224 and 256 bits. SHA 384 and SHA 512 are based on 64 bit words and produce digests of 384 and 512 bits.
The SHA-2 algorithm is computationally more complex the SHA 1, relying on carry propagate additions as well as logical operations and rotates. A critical path for a round of SHA-2 operations consists of four consecutive propagate additions with adder inputs being determined by complex logical and rotation functions. FIG. 1 depicts details of the SHA-2 algorithm. A, B, C, D, E, F, G, and H represent the 8 words of state (32 bits for SHA 224/256 and 64 bits for SHA384/512). Following operations are performed for each iteration:Ch(E,F,G)=(EF)⊕(EG)Ma(A,B,C)=(AB)⊕(AC)⊕(BC)Σ0(A)=(A>>>2)⊕(A>>>13)⊕(A>>>22)Σ1(E)=(E>>>6)⊕(E>>>11)⊕(E>>>25)The bitwise rotation uses different constants for SHA-512. In this example, the given numbers are for SHA-256. Constant K plus Wi message input addition can be performed ahead of the round critical path. The message scheduling function for the SHA-2 algorithm is also more complex than SHA-1 relying on rotated copies of previous message inputs to form message inputs:
for i from 16 to 63s0:=(w[i−15]ROTR 7)XOR(w[i−15]ROTR 18)XOR(w[i−15]SHR 3)s1:=(w[i−2]ROTR 17)XOR(w[i−2]ROTR 19)XOR(w[i−2]SHR 10)w[i]:=w[i−16]+s0+w[i−7]+s1where ROTR (also used as “>>>”) denotes a bitwise right-rotate operator; SHR denotes a bitwise right-shift operator; and XOR denotes a bitwise exclusive-OR operator.
For SHA-256, each iteration is performed as follows:Σ0:=(a ROTR 2)XOR(a ROTR 13)XOR(a ROTR 22)maj:=(a AND b)XOR(a AND c)XOR(b AND c)t2:=Σ0+maj Σ1:=(e ROTR 6)XOR(e ROTR 11)XOR(e ROTR 25)ch:=(e AND f)XOR((NOT e)AND g)t1:=h+Σ1+ch+k[i]+w[i]                h:=g        g:=f        f:=e        e:=d+t1        d:=c        c:=b        b:=a        a:=t1+t2Message input w[i] for rounds 1 to 16 is the 32 bit×16=512 bit block of data. W[i] for rounds 17 to 64 must be derived. Constant K is specified for each round, the W[i]+K[i] value for each round can calculated ahead of the actual round iteration. Further detailed information concerning the SHA-2 specification can be found in Secure Hash Standard published by Federal Information Processing Standard Publication (FIPS PUB 180-3, published October, 2008).        
The round processing requires 8 (32-bit) state variables A through H. It is possible to split these across two 128-bit registers. However, to be able to compute a whole round in the data-path, we also need the w[i]+k[i] inputs. Even if these can be added earlier, it introduces at least another source operand which we cannot use in a 2-source processor. An option would be to add the WK value to H prior to each round, and have the instruction process one round. This would add instructions that would limit the throughput, and more importantly directly add to the latency of around. The next round cannot start before the completion of a previous round. If we have 1 and 3-cycle pipelines in the single instruction multiple data (SIMD) units, then we would be limited to 1+3 cycles per round at best. In a previous design we proposed an instruction set on the 128-bit register set with two operands, that is able to achieve 3 cycles per round, by careful partitioning of the state variables, injection of the WK values and computing multiple rounds per instruction. There has been a lack of efficient ways to perform the above operations.