The processing elements as described in the aforementioned U.S. Pat. No. 7,080,110 patent can be operated in a more efficient fashion. In particular, it is noted that the cited U.S. Pat. No. 6,978,016 discusses two modes of operation: a CRT (Chinese Remainder Theorem) mode of operation and a non-CRT mode. In CRT mode the chain of processing elements can be split so as to perform two Montgomery multiplication operations at the same time. In the non-CRT mode, all of the processing elements operate as a single chain. When operands of large size are presented to the engine, the rightmost processing element, PE0, bears a heavy load of processing while processing elements further “down stream” experience a very light load. For example, the RSA implementation for the cryptography engine described in the patents cited above exhibited poor load balancing. In one case one Processing Element experienced 16 loadings while some others only had two loadings, a fact that makes meeting timing goals more difficult. This has a negative impact on the overall system time to completion.