Montgomery Multiplication

The cleanest implementation I found is in [1], under the title "Improvement 2". This is the same algorithm as CIOS in [4], but without the lower-level details.
  1: montgomery_product(A,B,N,R):
  2:   n0'= -n0-1
  3:   T = 0
  4:   for i = 0 .. s-1
  5:     T += ai B di
  6:     m = ti n0' (mod d)
  7:     T += m N di
  8:   return T/R
Here, I need to explain a couple of variables:

Effect of R

R is chosen to be one bit longer than N. That is:
 R = 2r, N>>(r-1)=1 
If r is a multiple of the machine word length(w), then the division at (8) can be done easily by a copy operation:
  for i= 0 .. s-1
    K[i]= T[i+s]
If that is not the case, then we need to actually shift each word. Since nobody else talks about this issue, I won't either. I will do it if I need to do it. For now, I will assume r%w=0.

The OpenSSL implementation aligns R to the next machine word. Of course this necessiates some divisions to be done, although not too many.

n0'

The computation of n0' is also given in [2]. Here is the C version of it:
moWORD moN0p(moWORD N0)
{ 
  moWORD y, B;
  int i;
  for(y=1,B=(1<<1),i=1;i<moWORD_BITS;i++,B<<=1) y |= (N0*y) & B;
  return -y; 
}             

Moving On

The algorithm is in fact pretty simple. There is only one operator needed: scalar multiplication to accumulator.
100: function K += a × B:
101:   C= 0
102:   Hp= 0
103:   for i = 0 .. |B|-1
104:      H:L= a * B[i]
105:      C:K[i] = K[i] + L + C + Hp
106:      Hp= H
107:   C:K[|B|]= K[|B|] + Hp + C
108:   K[|B|+1]= K[|B|+1] + C
With this operator, we can re-write our original algorithm:
  1: montgomery_product(A,B,N,R):
  2:   n0p= -n0-1
  3:   T = 0
  4:   for i = 0 .. s-1
  5:      T{i} += a[i] × B
  6:      H: m = T[i] * n0p // ignore H
  7:      T{i} += m × N 
  8:   return T/R
Notice that the multiplication with di is removed from (5) and (7). This is because we start the addition at the ith position.

CIOS algorithm in [4] makes use of the fact that T[0]==0 at the end of each iteration of the outer loop. It shifts the rest of T to the right to save some space. This also avoids the last shift to the right at (8).

  1: montgomery_product(A,B,N,R): T
  2:   n0p= -n0-1
  3:   T = 0
  4:   for i = 0 .. s-1
  5:     T += a[i] × B
  6:     H:m = T[0] * n0p // ignore H
  7:     T += m × N 
  8:     for j= 0 .. s
  9:        T[j]= T[j+1]
 10:   return T≥N?T-N:T
Notice that we no longer add to T{i} but T[i]. This is because at the ith iteration of the loop T has been shifted right i times.

In [4], the loop at (8-9) is also eliminated by incorporating the shift into the × operator. We can also do this by providing difference addresses to read and write from the accumulator:

100: function D = K + a × B
101:   C= 0
102:   Hp= 0
103:   for i = 0 .. |B|-1
104:      H:L= a * B[i]
105:      C:D[i] = K[i] + L + C + Hp
106:      Hp= H
107:   C:D[|B|]= K[|B|] + Hp + C
108:   D[|B|+1]= K[|B|+1] + C
With this modification, only this operator will be sufficient to write the product function.
  1: montgomery_product(A,B,N,R): T
  2:   n0p= -n0-1
  3:   T = 0
  4:   for i = 0 .. s-1
  5:     T{0} = T + a[i] × B
  6:     H:m = T[0] * n0p 
  7:     T{-1} = T + m × N 
  8:   return T≥N?T-N:T
Notice that we need an extra space at T[-1] in order to store the first value of the second multiplication. So, the total memory requirement for T is s+3.

The extra space can be avoided by unrolling the second loop and rewriting the initial iteration (as is done in [4]). However, the code is much simpler this way and an extra useless write won't hurt us too much, considering all the other operations done. In total, we will make s more writes than the original, which is not very much.

Conversion into Montgomery Form

The product function works only if the operands are in Montgomery form. That, is both operands should have been multiplied by R in mod N for correct operation. For brevity, let's alias M(a,b) to montgomery_product(a,b,N) for a given N. We know that
  M(a,b)= ab/R (mod N)
The montgomery representation of a number u is G(u):
  G(u)= uR (mod N) 
  M(u, R2) = uR2/R (mod N)= uR (mod N)= G(u)
Since the arguments to M must be between 0 and N-1, we need to compute (R2 (mod N)). Computing R mod N is straightforward.
 If R=2k
 R (mod N) = 2k-1 + 1 (mod N)
           = (2k-1 (mod N) + 1 (mod N)) (mod N)
           = (p(k) + 1) (mod N)
We already know that 
  2k-1<N<2k
therefore,
  2k-1 and N has the same number of bits.
At most one subtraction of N from 2k-1 will let us compute p(k) since the highest bit 2k-1 will be zero after the subtraction. Also, the second % operator above is useless because the result of the subtraction will always be less than N-1.
Let's set,
   U= R (mod N)
as calculated above. Then,
   V1= 2R (mod N) 
can be easily computed with at most one subtraction:
  0≤U<N
  0≤2U<2N
  -N≤2U-N<N
If we do the subtraction only when 2u≥N, then we have
  0≤V1<N
with a single comparison and at most one subtraction.

So, we got
  U=  R (mod N)
  V1= 2R (mod N)= 21R (mod N)
Observe that M can be used to compute G(2x) using V1 as a starting point.
  M(V1,V1)= M(2R (mod N),2R (mod N))
  M(V1,V1)= 4R2/R (mod N)
  M(V1,V1)= 4R (mod N) = V2
  V2= 22R (mod N)
If we have
  Va= 2aR (mod N)
  Vb= 2bR (mod N)
Let's apply M to these values to see what happens
  M(Va,Vb)= M(2aR (mod N),2bR (mod N))
  = (2aR (mod N) × 2bR (mod N))/R (mod N)
  = (2a+bR2 (mod N))/R (mod N)
  = 2a+bR2/R (mod N)
  = 2a+bR (mod N) = Va+b = G(2a+b)
So,
 Va+b=M(Va,Vb)
We can now compute any Vx=G(2x) using the above identity. We need G(2k)=Vk=2kR (mod N)=R2 (mod N). We could simply start from V1 and feed it to M() for k times to reach Vk. However, there is a better way using the exponentiation algorithm. Let's set V(x)=Vx for clarity and represent k in binary:
  k= 2d0+2d1..2du, di distinct
Then,
  V(k)= V(2d0+2d1..2du) 
  V(k)= M(V(2d0),V(2d1)..V(2du))
We see that it's enough to only compute V(2i) in order to compute V(k), we don't need to walk one by one. This results in the following algorithm:
r2_mod_n(2k-1<N<2k):   // R is 2k
  U= (2k-1)-N+1       // % is an optional subtraction 
  V= 2U%N            //  V is V(1)=V(21)
  A_set= false
  for p=1 ; p<=k ; p=2p
    if (k & p) 
       if (A_set)  A= M(A,V)
             else  A= V; A_set= true
    V= M(V,V)
  return A
We can simplify the algorithm by getting rid of the A_set variable and the associated conditional. For this, we can initialize A with the identity element for M. This is of course the value R (mod N).
 M(a, R (mod N))= aR/R (mod N) = a (mod N)
This value is already computed as U in the above algorithm. We can simply get alias U to A since U is not used later on in the algorithm.
31: r2_mod_n(2k-1<N<2k):  
32: A= (2k-1)-N+1
33: V= 2A%N            
34: for p=1 ; p<=k ; p=2p
35:   if (k & p) A= M(A,V)
36:   V= M(V,V)
37: return A
Well, this is embarassing to admit but after all that work, I ended up with the the exponentiation function. The only difference to the normal exponentiation is the fact that we don't convert the initial value to montgomery form but compute it ourselves. Anyway..

A similar algorithm is given in [3], but that one works only for k that are powers of two, ie. length of the prime N is 2x bits.

Conversion from Montgomery Form

This can be done by computing M(A,1). If we propagate the value 1 to the prodct function, we get rid one of the calls to +× operator.
 41: montgomery_reduce(A,N):
 42:   n0p= -n0-1
 43:   T = 0
 44:   for i = 0 .. s-1
 45:     H:m = T[0] * n0p 
 46:     T{-1} = T + m × N 
 47:   return T≥N?T-N:T

Computing an Exponent

Now that we have a multiplier, we can use it for the original intent of making it. Here is some pseudo-code:
51: montgomery_exponent(B, X, N):  // compute BX (mod N)
52:   A= R (mod N)  // identity element
53:   K= convert_to_montgomery(B,N)
54:   for i= 0 .. #bits(X)-1
55:      if X{i}  A= montgomery_product(A,K,N)
56:      K= montgomery_product(K,K,N)
57:   return convert_from_montgomery(A,N)

Side-Channel Attacks

These work by measuring either the power consumption or the RF output of the CPU while calculating an exponent. The optional multiplication at (55) make it possible to deduce bits of X by looking at the fluctuations in the mentioned parameters.

A common defense against this is to do a multiplication no matter X{i} is. When X{i} is 0, the second term of the product is just 1.

While this is a very good looking solution, it's wastes CPU cycles and power. I think an alternative can be used by consuming more memory. Instead of multiplying A immediately at each iteration, we can group the multiplications together so that an attacker can't tell which bits are set. Potentially, they will be able to tell the number of bits set in each group. I don't know how much effect this would have on the security of a key. Here is the method:

51: montgomery_exponent(B, X, N):  
52:   A= R (mod N)  // identity element
53:   K= convert_to_montgomery(B,N)
54:   for i= 0 .. #words(X)-1
55:      for j=0 .. #bits_per_word-1
56:          U[j]= K
57:          K= montgomery_product(K,K,N)
58:      L= ∅
59:      for j=0 .. #bits_per_word-1
60:         if X[i]{j}: L= L ⋃ {j}
61:      for j in L
62:         A= montgomery_product(A,U[j],N)
63:   return convert_from_montgomery(A,N)
Now the conditional part is much smaller in (60), and it can be made unconditional just by storing j in some dummy variable when it doesn't belong in L.

However, the loop at (58-60) still separates the two multiplication loops. This could expose the number of set bits in each word. This can be overcome by converting X into a list of L values before doing all the multiplications. It would require some extra memory but the setup time between the two multiplication loops would be minimal.

I don't really know how sensitive the equipment used in side-channel attacks can get. Apperantly, they are sensitive enough to detect the loop-setup code at (54) in the original algorithm. Therefore, it's best to stick with the commonly implemented method of multiplying always with something.

I will do the side-channel defense mechanism later on, when I have the code working correctly.

Download

Here is the latest version. You'll need the OpenSSL library in order to compile the test program.

References

  1. A Cryptographic Library for the Motorola DSP56000 (S. R. Dusse and B. S. Kaliski Jr.)
  2. Modular multiplication without trial division (P. L. Montgomery)
  3. How to compute R2modN? Using Montgomery modular multiplication
  4. Analyzing and Comparing Montgomery Multiplication Algorithms (T. Acar,Burton S. Kaliski Jr. and C.K. Koc)