1: montgomery_product(A,B,N,R): 2: n0'= -n0-1 3: T = 0 4: for i = 0 .. s-1 5: T += ai B di 6: m = ti n0' (mod d) 7: T += m N di 8: return T/RHere, I need to explain a couple of variables:
R = 2r, N>>(r-1)=1If r is a multiple of the machine word length(w), then the division at (8) can be done easily by a copy operation:
for i= 0 .. s-1 K[i]= T[i+s]If that is not the case, then we need to actually shift each word. Since nobody else talks about this issue, I won't either. I will do it if I need to do it. For now, I will assume r%w=0.
The OpenSSL implementation aligns R to the next machine word. Of course this necessiates some divisions to be done, although not too many.
moWORD moN0p(moWORD N0) { moWORD y, B; int i; for(y=1,B=(1<<1),i=1;i<moWORD_BITS;i++,B<<=1) y |= (N0*y) & B; return -y; }
100: function K += a × B: 101: C= 0 102: Hp= 0 103: for i = 0 .. |B|-1 104: H:L= a * B[i] 105: C:K[i] = K[i] + L + C + Hp 106: Hp= H 107: C:K[|B|]= K[|B|] + Hp + C 108: K[|B|+1]= K[|B|+1] + CWith this operator, we can re-write our original algorithm:
1: montgomery_product(A,B,N,R): 2: n0p= -n0-1 3: T = 0 4: for i = 0 .. s-1 5: T{i} += a[i] × B 6: H: m = T[i] * n0p // ignore H 7: T{i} += m × N 8: return T/RNotice that the multiplication with di is removed from (5) and (7). This is because we start the addition at the ith position.
CIOS algorithm in [4] makes use of the fact that T[0]==0 at the end of each iteration of the outer loop. It shifts the rest of T to the right to save some space. This also avoids the last shift to the right at (8).
1: montgomery_product(A,B,N,R): T 2: n0p= -n0-1 3: T = 0 4: for i = 0 .. s-1 5: T += a[i] × B 6: H:m = T[0] * n0p // ignore H 7: T += m × N 8: for j= 0 .. s 9: T[j]= T[j+1] 10: return T≥N?T-N:TNotice that we no longer add to T{i} but T[i]. This is because at the ith iteration of the loop T has been shifted right i times.
In [4], the loop at (8-9) is also eliminated by incorporating the shift into the × operator. We can also do this by providing difference addresses to read and write from the accumulator:
100: function D = K + a × B 101: C= 0 102: Hp= 0 103: for i = 0 .. |B|-1 104: H:L= a * B[i] 105: C:D[i] = K[i] + L + C + Hp 106: Hp= H 107: C:D[|B|]= K[|B|] + Hp + C 108: D[|B|+1]= K[|B|+1] + CWith this modification, only this operator will be sufficient to write the product function.
1: montgomery_product(A,B,N,R): T 2: n0p= -n0-1 3: T = 0 4: for i = 0 .. s-1 5: T{0} = T + a[i] × B 6: H:m = T[0] * n0p 7: T{-1} = T + m × N 8: return T≥N?T-N:TNotice that we need an extra space at T[-1] in order to store the first value of the second multiplication. So, the total memory requirement for T is s+3.
The extra space can be avoided by unrolling the second loop and rewriting the initial iteration (as is done in [4]). However, the code is much simpler this way and an extra useless write won't hurt us too much, considering all the other operations done. In total, we will make s more writes than the original, which is not very much.
M(a,b)= ab/R (mod N)The montgomery representation of a number u is G(u):
G(u)= uR (mod N) M(u, R2) = uR2/R (mod N)= uR (mod N)= G(u)Since the arguments to M must be between 0 and N-1, we need to compute (R2 (mod N)). Computing R mod N is straightforward.
If R=2k R (mod N) = 2k-1 + 1 (mod N) = (2k-1 (mod N) + 1 (mod N)) (mod N) = (p(k) + 1) (mod N) We already know that 2k-1<N<2k therefore, 2k-1 and N has the same number of bits.At most one subtraction of N from 2k-1 will let us compute p(k) since the highest bit 2k-1 will be zero after the subtraction. Also, the second % operator above is useless because the result of the subtraction will always be less than N-1.
Let's set, U= R (mod N) as calculated above. Then, V1= 2R (mod N) can be easily computed with at most one subtraction: 0≤U<N 0≤2U<2N -N≤2U-N<N If we do the subtraction only when 2u≥N, then we have 0≤V1<N with a single comparison and at most one subtraction. So, we got U= R (mod N) V1= 2R (mod N)= 21R (mod N)Observe that M can be used to compute G(2x) using V1 as a starting point.
M(V1,V1)= M(2R (mod N),2R (mod N)) M(V1,V1)= 4R2/R (mod N) M(V1,V1)= 4R (mod N) = V2 V2= 22R (mod N) If we have Va= 2aR (mod N) Vb= 2bR (mod N) Let's apply M to these values to see what happens M(Va,Vb)= M(2aR (mod N),2bR (mod N)) = (2aR (mod N) × 2bR (mod N))/R (mod N) = (2a+bR2 (mod N))/R (mod N) = 2a+bR2/R (mod N) = 2a+bR (mod N) = Va+b = G(2a+b) So, Va+b=M(Va,Vb)We can now compute any Vx=G(2x) using the above identity. We need G(2k)=Vk=2kR (mod N)=R2 (mod N). We could simply start from V1 and feed it to M() for k times to reach Vk. However, there is a better way using the exponentiation algorithm. Let's set V(x)=Vx for clarity and represent k in binary:
k= 2d0+2d1..2du, di distinct Then, V(k)= V(2d0+2d1..2du) V(k)= M(V(2d0),V(2d1)..V(2du))We see that it's enough to only compute V(2i) in order to compute V(k), we don't need to walk one by one. This results in the following algorithm:
r2_mod_n(2k-1<N<2k): // R is 2k U= (2k-1)-N+1 // % is an optional subtraction V= 2U%N // V is V(1)=V(21) A_set= false for p=1 ; p<=k ; p=2p if (k & p) if (A_set) A= M(A,V) else A= V; A_set= true V= M(V,V) return AWe can simplify the algorithm by getting rid of the A_set variable and the associated conditional. For this, we can initialize A with the identity element for M. This is of course the value R (mod N).
M(a, R (mod N))= aR/R (mod N) = a (mod N)This value is already computed as U in the above algorithm. We can simply get alias U to A since U is not used later on in the algorithm.
31: r2_mod_n(2k-1<N<2k): 32: A= (2k-1)-N+1 33: V= 2A%N 34: for p=1 ; p<=k ; p=2p 35: if (k & p) A= M(A,V) 36: V= M(V,V) 37: return AWell, this is embarassing to admit but after all that work, I ended up with the the exponentiation function. The only difference to the normal exponentiation is the fact that we don't convert the initial value to montgomery form but compute it ourselves. Anyway..
A similar algorithm is given in [3], but that one works only for k that are powers of two, ie. length of the prime N is 2x bits.
41: montgomery_reduce(A,N): 42: n0p= -n0-1 43: T = 0 44: for i = 0 .. s-1 45: H:m = T[0] * n0p 46: T{-1} = T + m × N 47: return T≥N?T-N:T
51: montgomery_exponent(B, X, N): // compute BX (mod N) 52: A= R (mod N) // identity element 53: K= convert_to_montgomery(B,N) 54: for i= 0 .. #bits(X)-1 55: if X{i} A= montgomery_product(A,K,N) 56: K= montgomery_product(K,K,N) 57: return convert_from_montgomery(A,N)
A common defense against this is to do a multiplication no matter X{i} is. When X{i} is 0, the second term of the product is just 1.
While this is a very good looking solution, it's wastes CPU cycles and power. I think an alternative can be used by consuming more memory. Instead of multiplying A immediately at each iteration, we can group the multiplications together so that an attacker can't tell which bits are set. Potentially, they will be able to tell the number of bits set in each group. I don't know how much effect this would have on the security of a key. Here is the method:
51: montgomery_exponent(B, X, N): 52: A= R (mod N) // identity element 53: K= convert_to_montgomery(B,N) 54: for i= 0 .. #words(X)-1 55: for j=0 .. #bits_per_word-1 56: U[j]= K 57: K= montgomery_product(K,K,N) 58: L= ∅ 59: for j=0 .. #bits_per_word-1 60: if X[i]{j}: L= L ⋃ {j} 61: for j in L 62: A= montgomery_product(A,U[j],N) 63: return convert_from_montgomery(A,N)Now the conditional part is much smaller in (60), and it can be made unconditional just by storing j in some dummy variable when it doesn't belong in L.
However, the loop at (58-60) still separates the two multiplication loops. This could expose the number of set bits in each word. This can be overcome by converting X into a list of L values before doing all the multiplications. It would require some extra memory but the setup time between the two multiplication loops would be minimal.
I don't really know how sensitive the equipment used in side-channel attacks can get. Apperantly, they are sensitive enough to detect the loop-setup code at (54) in the original algorithm. Therefore, it's best to stick with the commonly implemented method of multiplying always with something.
I will do the side-channel defense mechanism later on, when I have the code working correctly.