1: montgomery_product(A,B,N,R): 2: nHere, I need to explain a couple of variables:_{0}'= -n_{0}^{-1}3: T = 0 4: for i = 0 .. s-1 5: T += a_{i}B d^{i}6: m = t_{i}n_{0}' (mod d) 7: T += m N d^{i}8: return T/R

- s: length of A,B and N in digits.
- d: the base in which the numbers are expressed.
In our case, we will use the longest machine word that can do multiplications
without losing the upper half of the result. So, it will be something
like 2
^{64}.

R = 2If r is a multiple of the machine word length(w), then the division at (8) can be done easily by a copy operation:^{r}, N>>(r-1)=1

for i= 0 .. s-1 K[i]= T[i+s]If that is not the case, then we need to actually shift each word. Since nobody else talks about this issue, I won't either. I will do it if I need to do it. For now, I will assume r%w=0.

The OpenSSL implementation aligns R to the next machine word. Of course this necessiates some divisions to be done, although not too many.

moWORD moN0p(moWORD N0) { moWORD y, B; int i; for(y=1,B=(1<<1),i=1;i<moWORD_BITS;i++,B<<=1) y |= (N0*y) & B; return -y; }

100: function K += a × B: 101: C= 0 102: Hp= 0 103: for i = 0 .. |B|-1 104: H:L= a * B[i] 105: C:K[i] = K[i] + L + C + Hp 106: Hp= H 107: C:K[|B|]= K[|B|] + Hp + C 108: K[|B|+1]= K[|B|+1] + CWith this operator, we can re-write our original algorithm:

1: montgomery_product(A,B,N,R): 2: n0p= -nNotice that the multiplication with d_{0}^{-1}3: T = 0 4: for i = 0 .. s-1 5: T{i} += a[i] × B 6: H: m = T[i] * n0p // ignore H 7: T{i} += m × N 8: return T/R

CIOS algorithm in [4] makes use of the fact that T[0]==0 at the end of each iteration of the outer loop. It shifts the rest of T to the right to save some space. This also avoids the last shift to the right at (8).

1: montgomery_product(A,B,N,R): T 2: n0p= -nNotice that we no longer add to T{i} but T[i]. This is because at the i_{0}^{-1}3: T = 0 4: for i = 0 .. s-1 5: T += a[i] × B 6: H:m = T[0] * n0p // ignore H 7: T += m × N 8: for j= 0 .. s 9: T[j]= T[j+1] 10: return T≥N?T-N:T

In [4], the loop at (8-9) is also eliminated by incorporating the shift into the × operator. We can also do this by providing difference addresses to read and write from the accumulator:

100: function D = K + a × B 101: C= 0 102: Hp= 0 103: for i = 0 .. |B|-1 104: H:L= a * B[i] 105: C:D[i] = K[i] + L + C + Hp 106: Hp= H 107: C:D[|B|]= K[|B|] + Hp + C 108: D[|B|+1]= K[|B|+1] + CWith this modification, only this operator will be sufficient to write the product function.

1: montgomery_product(A,B,N,R): T 2: n0p= -nNotice that we need an extra space at T[-1] in order to store the first value of the second multiplication. So, the total memory requirement for T is s+3._{0}^{-1}3: T = 0 4: for i = 0 .. s-1 5: T{0} = T + a[i] × B 6: H:m = T[0] * n0p 7: T{-1} = T + m × N 8: return T≥N?T-N:T

The extra space can be avoided by unrolling the second loop and rewriting the initial iteration (as is done in [4]). However, the code is much simpler this way and an extra useless write won't hurt us too much, considering all the other operations done. In total, we will make s more writes than the original, which is not very much.

M(a,b)= ab/R (mod N)The montgomery representation of a number u is G(u):

G(u)= uR (mod N) M(u, RSince the arguments to M must be between 0 and N-1, we need to compute (R^{2}) = uR^{2}/R (mod N)= uR (mod N)= G(u)

If R=2At most one subtraction of N from 2^{k}R (mod N) = 2^{k}-1 + 1 (mod N) = (2^{k}-1 (mod N) + 1 (mod N)) (mod N) = (p(k) + 1) (mod N) We already know that 2^{k-1}<N<2^{k}therefore, 2^{k}-1 and N has the same number of bits.

Let's set, U= R (mod N) as calculated above. Then, VObserve that M can be used to compute G(2_{1}= 2R (mod N) can be easily computed with at most one subtraction: 0≤U<N 0≤2U<2N -N≤2U-N<N If we do the subtraction only when 2u≥N, then we have 0≤V_{1}<N with a single comparison and at most one subtraction. So, we got U= R (mod N) V_{1}= 2R (mod N)= 2^{1}R (mod N)

M(VWe can now compute any V_{1},V_{1})= M(2R (mod N),2R (mod N)) M(V_{1},V_{1})= 4R^{2}/R (mod N) M(V_{1},V_{1})= 4R (mod N) = V_{2}V_{2}= 2^{2}R (mod N) If we have V_{a}= 2^{a}R (mod N) V_{b}= 2^{b}R (mod N) Let's apply M to these values to see what happens M(V_{a},V_{b})= M(2^{a}R (mod N),2^{b}R (mod N)) = (2^{a}R (mod N) × 2^{b}R (mod N))/R (mod N) = (2^{a+b}R^{2}(mod N))/R (mod N) = 2^{a+b}R^{2}/R (mod N) = 2^{a+b}R (mod N) = V_{a+b}= G(2^{a+b}) So, V_{a+b}=M(V_{a},V_{b})

k= 2We see that it's enough to only compute V(2^{d0}+2^{d1}..2^{du}, d_{i}distinct Then, V(k)= V(2^{d0}+2^{d1}..2^{du}) V(k)= M(V(2^{d0}),V(2^{d1})..V(2^{du}))

r2_mod_n(2We can simplify the algorithm by getting rid of the A_set variable and the associated conditional. For this, we can initialize A with the identity element for M. This is of course the value R (mod N).^{k-1}<N<2^{k}): // R is 2^{k}U= (2^{k}-1)-N+1 // % is an optional subtraction V= 2U%N // V is V(1)=V(2^{1}) A_set= false for p=1 ; p<=k ; p=2p if (k & p) if (A_set) A= M(A,V) else A= V; A_set= true V= M(V,V) return A

M(a, R (mod N))= aR/R (mod N) = a (mod N)This value is already computed as U in the above algorithm. We can simply get alias U to A since U is not used later on in the algorithm.

31: r2_mod_n(2Well, this is embarassing to admit but after all that work, I ended up with the the exponentiation function. The only difference to the normal exponentiation is the fact that we don't convert the initial value to montgomery form but compute it ourselves. Anyway..^{k-1}<N<2^{k}): 32: A= (2^{k}-1)-N+1 33: V= 2A%N 34: for p=1 ; p<=k ; p=2p 35: if (k & p) A= M(A,V) 36: V= M(V,V) 37: return A

A similar algorithm is given in [3], but that one works only for k that
are powers of two, ie. length of the prime N is 2^{x} bits.

41: montgomery_reduce(A,N): 42: n0p= -n_{0}^{-1}43: T = 0 44: for i = 0 .. s-1 45: H:m = T[0] * n0p 46: T{-1} = T + m × N 47: return T≥N?T-N:T

51: montgomery_exponent(B, X, N): // compute B^{X}(mod N) 52: A= R (mod N) // identity element 53: K= convert_to_montgomery(B,N) 54: for i= 0 .. #bits(X)-1 55: if X{i} A= montgomery_product(A,K,N) 56: K= montgomery_product(K,K,N) 57: return convert_from_montgomery(A,N)

A common defense against this is to do a multiplication no matter X{i} is. When X{i} is 0, the second term of the product is just 1.

While this is a very good looking solution, it's wastes CPU cycles and power. I think an alternative can be used by consuming more memory. Instead of multiplying A immediately at each iteration, we can group the multiplications together so that an attacker can't tell which bits are set. Potentially, they will be able to tell the number of bits set in each group. I don't know how much effect this would have on the security of a key. Here is the method:

51: montgomery_exponent(B, X, N): 52: A= R (mod N) // identity element 53: K= convert_to_montgomery(B,N) 54: for i= 0 .. #words(X)-1 55: for j=0 .. #bits_per_word-1 56: U[j]= K 57: K= montgomery_product(K,K,N) 58: L= ∅ 59: for j=0 .. #bits_per_word-1 60: if X[i]{j}: L= L ⋃ {j} 61: for j in L 62: A= montgomery_product(A,U[j],N) 63: return convert_from_montgomery(A,N)Now the conditional part is much smaller in (60), and it can be made unconditional just by storing j in some dummy variable when it doesn't belong in L.

However, the loop at (58-60) still separates the two multiplication loops. This could expose the number of set bits in each word. This can be overcome by converting X into a list of L values before doing all the multiplications. It would require some extra memory but the setup time between the two multiplication loops would be minimal.

I don't really know how sensitive the equipment used in side-channel attacks can get. Apperantly, they are sensitive enough to detect the loop-setup code at (54) in the original algorithm. Therefore, it's best to stick with the commonly implemented method of multiplying always with something.

I will do the side-channel defense mechanism later on, when I have the code working correctly.

- A Cryptographic Library for the Motorola DSP56000 (S. R. Dusse and B. S. Kaliski Jr.)
- Modular multiplication without trial division (P. L. Montgomery)
- How to compute R2modN? Using Montgomery modular multiplication
- Analyzing and Comparing Montgomery Multiplication Algorithms (T. Acar,Burton S. Kaliski Jr. and C.K. Koc)