# Modular Multiplicative Inverse

This is accomplished by using the Extended Euclidean Algorithm, everywhere I look. Even NIST.FIPS.186 in [2] uses it.

Here is the algorithm as specified in [2]

```modinv( z, a ):  // compute z-1 st. z*z-1≡ 1 (mod a)
verify  0 < z < a
i= a, j= z, y2= 0, y1= 1
while j>0
Q= i÷j, R= i%j
y= y2 - Qy1
i= j, j= R, y2= y1, y1= y
if i≠1
return ERROR
return y2≥0 ? y2 : y2+a
```
The most challenging operator to handle correctly is the division. In my implementation, the value 'z' will be a small integer that fits into a machine word. If we write down the values of i,j and R, we see that we only need a division-by-scalar operator.
``` {0}y2= 0
{0}y1= 1
{0}i= a
{0}j= z
{0}Q= {0}i÷{0}j = a÷z
{0}R= {0}i%{0}j = a%z
{0}y= {0}y2 - {0}Q{0}y1= 0 - (a÷z)*1= -a÷z
{1}i= {0}j= z
{1}j= {0}R= a%z
{1}y2= {0}y1= 1
{1}y1= {0}y= -a÷z
```
After iteration 0, i and j are both less than or equal to z. Therefore, Only the first division will be a multi-precision one. After that, we can use the CPUs division instruction directly.

We also note that the value Q is a multi-precision one in iteration 0. After that, Q is also a small integer fitting into a machine word. Therefore, the product Qy1 is calculated only once in multi-precision numbers and that happens when y1 is 1. Therefore, I don't need a multi×multi multiplier, only multi×scalar.

So, let's write the algorithm with these in mind with doubled-up operators representing multi-precision operations (except for + and -).

```modinv( z, a ):
verify  0 < z < a
i= a, j= z, y2= 0, y1= 1

Q= i÷÷j, R= i%%j ⇒ Q=a÷÷z, R=a%%z
y= y2 - Qy1 ⇒ y= 0 - Q*1= -Q
i= j, j= R, y2= y1, y1= y ⇒ i= z, j= R= a%%z, y2=1, y1= y= -Q

while j>0
Q= i÷j, R= i%j
y= y2 - y1××Q
i= j, j= R, y2= y1, y1= y
if i≠1
return ERROR
return y2≥0 ? y2 : y2+a
```
If we re-arrange the part before the loop, we have:
```modinv( z, a ):
verify  0 < z < a
i= z, j= a%%z, y2=1, y1= -a÷÷z

while j>0
Q= i÷j, R= i%j
y= y2 - y1××Q
i= j, j= R, y2= y1, y1= y
if i≠1
return ERROR
return y2≥0 ? y2 : y2+a
```
We now need the following multi-precision operators:
• Division by scalar, yielding a multi-precision quotient and a single-word remainder.
• Multiplication by a word.
• Subtraction.
Since the algorithm generates negative numbers, we need a sign bit for each number. By simple inspection, all computed values will have a magnitude less than 'a'. Therefore, pre-allocating them with as many words as 'a' will suffice.

## Division by Word

Since the multi-precision division will run only once, I don't need any fancy algorithms to make it super-fast. I can just use grade-school division for this. Even that will be quite fast, I'll need around 2N subtractions and comparisons where N is the number of bits in the dividend.

Let's make an example division A/F.

```A= a3a2a1a0 b3b2b1b0 c3c2c1c0 d3d2d1d0
F= f3f2f1f0
```
Now, let's look at all the multiples of F we need to optionally subtract from A:
```a3a2a1a0 b3b2b1b0 c3c2c1c0 d3d2d1d0

f3f2f1f0 0 0 0 0  0 0 0 0  0 0 0 0
0 f3f2f1 f00 0 0  0 0 0 0  0 0 0 0
0 0 f3f2 f1f00 0  0 0 0 0  0 0 0 0
0 0 0 f3 f2f1f00  0 0 0 0  0 0 0 0
0 0 0 0  f3f2f1f0 0 0 0 0  0 0 0 0
0 0 0 0  0 f3f2f1 f00 0 0  0 0 0 0
0 0 0 0  0 0 f3f2 f1f00 0  0 0 0 0
0 0 0 0  0 0 0 f3 f2f1f00  0 0 0 0
0 0 0 0  0 0 0 0  f3f2f1f0 0 0 0 0
0 0 0 0  0 0 0 0  0 f3f2f1 f00 0 0
0 0 0 0  0 0 0 0  0 0 f3f2 f1f00 0
0 0 0 0  0 0 0 0  0 0 0 f3 f2f1f00
0 0 0 0  0 0 0 0  0 0 0 0  f3f2f1f0
```
Let's assume an in-place divider where A is modified at each round to contain the remainder from the last subtraction.

As you can see, we have a lot of zeros involved and the significant bits of the subtracted value always fits in two machine words. Zeros to the right of the significant two words have no effect at all. We don't need to store or process them. Zeros to the left don't need to be subtracted either. Since we subtract a value from the accumulator only when the accumulator value is already larger, then we guarantee that there will be no borrow past the most significant bit of the subtracted value, the position of f3.

The basic algorithm can be summarized as follows, for a big-endian storage format:

```100: div(A,F):
101:   hi = F  << #leading-zeros(F),  lo = 0
102:   Q= 0, bit= |A|*bits_per_word-#bits(F)
103:   P= A
104:   while P+1 < A + |A|
105:     if P[0]:P[1] ≥ hi:lo then P[0]:P[1] -= hi:lo, Q(bit)= 1
106:     bit--
107:     if lo=F  then (hi,lo)= (F,0), P= P+1
108:     hi:lo >>= 1
110:   R= A[|A|-1]
```
The loop leaves P pointing at the least significant word of A.

It's worth noting that when we make a subtraction at (105), the highest set bit of P[0]:P[1] gets cleared. That is, at all times P[0]:P[1] ≤ 2(hi:lo)-1. If this wasn't true, then we must have missed the subtraction opportunity in the previous iteration of the loop, which is impossible.

At this point, all the words except the least significant word of A has been zeroed out by the subtractions. Which leads me to the question, why modify A at all? Since we won't be using those zeroed-out words of A later on, we don't need to store our intermediate values in A. We can simply store them in some temporary variables just like we did for F. So, here is the updated algorithm:

```100: div(A,F):
102:   Q= 0, bit= |A|*bits_per_word-#bits(F)
103:   P= A, Ahi:Alo= P[0]:P[1]
104:   while P+1 < A + |A|
105:     if Ahi:Alo ≥ Fhi:Flo then Ahi:Alo-= Fhi:Flo, Q(bit)= 1
106:     bit--
108:     if Flo=F  then (Fhi,Flo)= (F,0)
109:                    P= P+1,
110:                    (Ahi,Alo)= (Alo,P[1])
111:     Fhi:Flo >>= 1
112:   R= Ahi
```
The above algorithm has a slight error. Access to P[1] is not guaranteed at (103) and (110). These should be made after checking that P[1] is indeed within bounds.