Modular Multiplicative Inverse

This is accomplished by using the Extended Euclidean Algorithm, everywhere I look. Even NIST.FIPS.186 in [2] uses it.

Here is the algorithm as specified in [2]

modinv( z, a ):  // compute z^-1 st. z*z^-1≡ 1 (mod a)
  verify  0 < z < a
  i= a, j= z, y₂= 0, y₁= 1
  while j>0
    Q= i÷j, R= i%j
    y= y₂ - Qy₁
    i= j, j= R, y₂= y₁, y₁= y
  if i≠1 
    return ERROR
  return y₂≥0 ? y₂ : y₂+a

The most challenging operator to handle correctly is the division. In my implementation, the value 'z' will be a small integer that fits into a machine word. If we write down the values of i,j and R, we see that we only need a division-by-scalar operator.

 {0}y₂= 0
 {0}y₁= 1
 {0}i= a    
 {0}j= z  
 {0}Q= {0}i÷{0}j = a÷z 
 {0}R= {0}i%{0}j = a%z
 {0}y= {0}y₂ - {0}Q{0}y₁= 0 - (a÷z)*1= -a÷z
 {1}i= {0}j= z
 {1}j= {0}R= a%z
 {1}y₂= {0}y₁= 1
 {1}y₁= {0}y= -a÷z

After iteration 0, i and j are both less than or equal to z. Therefore, Only the first division will be a multi-precision one. After that, we can use the CPUs division instruction directly.

We also note that the value Q is a multi-precision one in iteration 0. After that, Q is also a small integer fitting into a machine word. Therefore, the product Qy₁ is calculated only once in multi-precision numbers and that happens when y₁ is 1. Therefore, I don't need a multi×multi multiplier, only multi×scalar.

So, let's write the algorithm with these in mind with doubled-up operators representing multi-precision operations (except for + and -).

modinv( z, a ): 
  verify  0 < z < a
  i= a, j= z, y₂= 0, y₁= 1
  
  Q= i÷÷j, R= i%%j ⇒ Q=a÷÷z, R=a%%z
  y= y₂ - Qy₁ ⇒ y= 0 - Q*1= -Q
  i= j, j= R, y₂= y₁, y₁= y ⇒ i= z, j= R= a%%z, y₂=1, y₁= y= -Q

  while j>0
    Q= i÷j, R= i%j
    y= y₂ - y₁××Q
    i= j, j= R, y₂= y₁, y₁= y
  if i≠1 
    return ERROR
  return y₂≥0 ? y₂ : y₂+a

If we re-arrange the part before the loop, we have:

modinv( z, a ): 
  verify  0 < z < a
  i= z, j= a%%z, y₂=1, y₁= -a÷÷z

  while j>0
    Q= i÷j, R= i%j
    y= y₂ - y₁××Q
    i= j, j= R, y₂= y₁, y₁= y
  if i≠1 
    return ERROR
  return y₂≥0 ? y₂ : y₂+a

We now need the following multi-precision operators:

Division by scalar, yielding a multi-precision quotient and a single-word remainder.
Multiplication by a word.
Addition.
Subtraction.

Since the algorithm generates negative numbers, we need a sign bit for each number. By simple inspection, all computed values will have a magnitude less than 'a'. Therefore, pre-allocating them with as many words as 'a' will suffice.

Division by Word

Since the multi-precision division will run only once, I don't need any fancy algorithms to make it super-fast. I can just use grade-school division for this. Even that will be quite fast, I'll need around 2N subtractions and comparisons where N is the number of bits in the dividend.

Let's make an example division A/F.

A= a₃a₂a₁a₀ b₃b₂b₁b₀ c₃c₂c₁c₀ d₃d₂d₁d₀
F= f₃f₂f₁f₀

Now, let's look at all the multiples of F we need to optionally subtract from A:

a₃a₂a₁a₀ b₃b₂b₁b₀ c₃c₂c₁c₀ d₃d₂d₁d₀

f₃f₂f₁f₀ 0000 0000 0000
0f₃f₂f₁ f₀000 0000 0000
00f₃f₂ f₁f₀00 0000 0000
000f₃ f₂f₁f₀0 0000 0000
0000 f₃f₂f₁f₀ 0000 0000
0000 0f₃f₂f₁ f₀000 0000
0000 00f₃f₂ f₁f₀00 0000
0000 000f₃ f₂f₁f₀0 0000
0000 0000 f₃f₂f₁f₀ 0000
0000 0000 0f₃f₂f₁ f₀000
0000 0000 00f₃f₂ f₁f₀00
0000 0000 000f₃ f₂f₁f₀0
0000 0000 0000 f₃f₂f₁f₀

Let's assume an in-place divider where A is modified at each round to contain the remainder from the last subtraction.

As you can see, we have a lot of zeros involved and the significant bits of the subtracted value always fits in two machine words. Zeros to the right of the significant two words have no effect at all. We don't need to store or process them. Zeros to the left don't need to be subtracted either. Since we subtract a value from the accumulator only when the accumulator value is already larger, then we guarantee that there will be no borrow past the most significant bit of the subtracted value, the position of f₃.

The basic algorithm can be summarized as follows, for a big-endian storage format:

100: div(A,F):
101:   hi = F  << #leading-zeros(F),  lo = 0
102:   Q= 0, bit= |A|*bits_per_word-#bits(F)
103:   P= A
104:   while P+1 < A + |A| 
105:     if P[0]:P[1] ≥ hi:lo then P[0]:P[1] -= hi:lo, Q(bit)= 1
106:     bit--
107:     if lo=F  then (hi,lo)= (F,0), P= P+1
108:     hi:lo >>= 1
110:   R= A[|A|-1]

The loop leaves P pointing at the least significant word of A.

It's worth noting that when we make a subtraction at (105), the highest set bit of P[0]:P[1] gets cleared. That is, at all times P[0]:P[1] ≤ 2(hi:lo)-1. If this wasn't true, then we must have missed the subtraction opportunity in the previous iteration of the loop, which is impossible.

At this point, all the words except the least significant word of A has been zeroed out by the subtractions. Which leads me to the question, why modify A at all? Since we won't be using those zeroed-out words of A later on, we don't need to store our intermediate values in A. We can simply store them in some temporary variables just like we did for F. So, here is the updated algorithm:

100: div(A,F):
101:   Fhi:Flo= F<<#leading-zeros(F):0
102:   Q= 0, bit= |A|*bits_per_word-#bits(F)
103:   P= A, Ahi:Alo= P[0]:P[1]
104:   while P+1 < A + |A| 
105:     if Ahi:Alo ≥ Fhi:Flo then Ahi:Alo-= Fhi:Flo, Q(bit)= 1
106:     bit--
108:     if Flo=F  then (Fhi,Flo)= (F,0)
109:                    P= P+1,
110:                    (Ahi,Alo)= (Alo,P[1])
111:     Fhi:Flo >>= 1
112:   R= Ahi

The above algorithm has a slight error. Access to P[1] is not guaranteed at (103) and (110). These should be made after checking that P[1] is indeed within bounds.

Downloads

Extended Euclidean source code for multi-precision numbers.

References

Wikipedia entry
NIST.FIPS.186 Contains a lot of commonly used algorithms.