Here is the algorithm as specified in [2]
modinv( z, a ): // compute z-1 st. z*z-1≡ 1 (mod a) verify 0 < z < a i= a, j= z, y2= 0, y1= 1 while j>0 Q= i÷j, R= i%j y= y2 - Qy1 i= j, j= R, y2= y1, y1= y if i≠1 return ERROR return y2≥0 ? y2 : y2+aThe most challenging operator to handle correctly is the division. In my implementation, the value 'z' will be a small integer that fits into a machine word. If we write down the values of i,j and R, we see that we only need a division-by-scalar operator.
{0}y2= 0 {0}y1= 1 {0}i= a {0}j= z {0}Q= {0}i÷{0}j = a÷z {0}R= {0}i%{0}j = a%z {0}y= {0}y2 - {0}Q{0}y1= 0 - (a÷z)*1= -a÷z {1}i= {0}j= z {1}j= {0}R= a%z {1}y2= {0}y1= 1 {1}y1= {0}y= -a÷zAfter iteration 0, i and j are both less than or equal to z. Therefore, Only the first division will be a multi-precision one. After that, we can use the CPUs division instruction directly.
We also note that the value Q is a multi-precision one in iteration 0. After that, Q is also a small integer fitting into a machine word. Therefore, the product Qy1 is calculated only once in multi-precision numbers and that happens when y1 is 1. Therefore, I don't need a multi×multi multiplier, only multi×scalar.
So, let's write the algorithm with these in mind with doubled-up operators representing multi-precision operations (except for + and -).
modinv( z, a ): verify 0 < z < a i= a, j= z, y2= 0, y1= 1 Q= i÷÷j, R= i%%j ⇒ Q=a÷÷z, R=a%%z y= y2 - Qy1 ⇒ y= 0 - Q*1= -Q i= j, j= R, y2= y1, y1= y ⇒ i= z, j= R= a%%z, y2=1, y1= y= -Q while j>0 Q= i÷j, R= i%j y= y2 - y1××Q i= j, j= R, y2= y1, y1= y if i≠1 return ERROR return y2≥0 ? y2 : y2+aIf we re-arrange the part before the loop, we have:
modinv( z, a ): verify 0 < z < a i= z, j= a%%z, y2=1, y1= -a÷÷z while j>0 Q= i÷j, R= i%j y= y2 - y1××Q i= j, j= R, y2= y1, y1= y if i≠1 return ERROR return y2≥0 ? y2 : y2+aWe now need the following multi-precision operators:
Let's make an example division A/F.
A= a3a2a1a0 b3b2b1b0 c3c2c1c0 d3d2d1d0 F= f3f2f1f0Now, let's look at all the multiples of F we need to optionally subtract from A:
a3a2a1a0 b3b2b1b0 c3c2c1c0 d3d2d1d0 f3f2f1f0 0 0 0 0 0 0 0 0 0 0 0 0 0 f3f2f1 f00 0 0 0 0 0 0 0 0 0 0 0 0 f3f2 f1f00 0 0 0 0 0 0 0 0 0 0 0 0 f3 f2f1f00 0 0 0 0 0 0 0 0 0 0 0 0 f3f2f1f0 0 0 0 0 0 0 0 0 0 0 0 0 0 f3f2f1 f00 0 0 0 0 0 0 0 0 0 0 0 0 f3f2 f1f00 0 0 0 0 0 0 0 0 0 0 0 0 f3 f2f1f00 0 0 0 0 0 0 0 0 0 0 0 0 f3f2f1f0 0 0 0 0 0 0 0 0 0 0 0 0 0 f3f2f1 f00 0 0 0 0 0 0 0 0 0 0 0 0 f3f2 f1f00 0 0 0 0 0 0 0 0 0 0 0 0 f3 f2f1f00 0 0 0 0 0 0 0 0 0 0 0 0 f3f2f1f0Let's assume an in-place divider where A is modified at each round to contain the remainder from the last subtraction.
As you can see, we have a lot of zeros involved and the significant bits of the subtracted value always fits in two machine words. Zeros to the right of the significant two words have no effect at all. We don't need to store or process them. Zeros to the left don't need to be subtracted either. Since we subtract a value from the accumulator only when the accumulator value is already larger, then we guarantee that there will be no borrow past the most significant bit of the subtracted value, the position of f3.
The basic algorithm can be summarized as follows, for a big-endian storage format:
100: div(A,F): 101: hi = F << #leading-zeros(F), lo = 0 102: Q= 0, bit= |A|*bits_per_word-#bits(F) 103: P= A 104: while P+1 < A + |A| 105: if P[0]:P[1] ≥ hi:lo then P[0]:P[1] -= hi:lo, Q(bit)= 1 106: bit-- 107: if lo=F then (hi,lo)= (F,0), P= P+1 108: hi:lo >>= 1 110: R= A[|A|-1]The loop leaves P pointing at the least significant word of A.
It's worth noting that when we make a subtraction at (105), the highest set bit of P[0]:P[1] gets cleared. That is, at all times P[0]:P[1] ≤ 2(hi:lo)-1. If this wasn't true, then we must have missed the subtraction opportunity in the previous iteration of the loop, which is impossible.
At this point, all the words except the least significant word of A has been zeroed out by the subtractions. Which leads me to the question, why modify A at all? Since we won't be using those zeroed-out words of A later on, we don't need to store our intermediate values in A. We can simply store them in some temporary variables just like we did for F. So, here is the updated algorithm:
100: div(A,F): 101: Fhi:Flo= F<<#leading-zeros(F):0 102: Q= 0, bit= |A|*bits_per_word-#bits(F) 103: P= A, Ahi:Alo= P[0]:P[1] 104: while P+1 < A + |A| 105: if Ahi:Alo ≥ Fhi:Flo then Ahi:Alo-= Fhi:Flo, Q(bit)= 1 106: bit-- 108: if Flo=F then (Fhi,Flo)= (F,0) 109: P= P+1, 110: (Ahi,Alo)= (Alo,P[1]) 111: Fhi:Flo >>= 1 112: R= AhiThe above algorithm has a slight error. Access to P[1] is not guaranteed at (103) and (110). These should be made after checking that P[1] is indeed within bounds.