Here is the algorithm as specified in [2]

modinv( z, a ): // compute zThe most challenging operator to handle correctly is the division. In my implementation, the value 'z' will be a small integer that fits into a machine word. If we write down the values of i,j and R, we see that we only need a division-by-scalar operator.^{-1}st. z*z^{-1}≡ 1 (mod a) verify 0 < z < a i= a, j= z, y_{2}= 0, y_{1}= 1 while j>0 Q= i÷j, R= i%j y= y_{2}- Qy_{1}i= j, j= R, y_{2}= y_{1}, y_{1}= y if i≠1 return ERROR return y_{2}≥0 ? y_{2}: y_{2}+a

{0}yAfter iteration 0, i and j are both less than or equal to z. Therefore, Only the first division will be a multi-precision one. After that, we can use the CPUs division instruction directly._{2}= 0 {0}y_{1}= 1 {0}i= a {0}j= z {0}Q= {0}i÷{0}j = a÷z {0}R= {0}i%{0}j = a%z {0}y= {0}y_{2}- {0}Q{0}y_{1}= 0 - (a÷z)*1= -a÷z {1}i= {0}j= z {1}j= {0}R= a%z {1}y_{2}= {0}y_{1}= 1 {1}y_{1}= {0}y= -a÷z

We also note that the value Q is a multi-precision one in iteration 0.
After that, Q is also a small integer fitting into a machine word.
Therefore, the product Qy_{1} is calculated only once
in multi-precision numbers and that happens when y_{1} is 1.
Therefore, I don't need a multi×multi multiplier, only multi×scalar.

So, let's write the algorithm with these in mind with doubled-up operators representing multi-precision operations (except for + and -).

modinv( z, a ): verify 0 < z < a i= a, j= z, yIf we re-arrange the part before the loop, we have:_{2}= 0, y_{1}= 1 Q= i÷÷j, R= i%%j ⇒ Q=a÷÷z, R=a%%z y= y_{2}- Qy_{1}⇒ y= 0 - Q*1= -Q i= j, j= R, y_{2}= y_{1}, y_{1}= y ⇒ i= z, j= R= a%%z, y_{2}=1, y_{1}= y= -Q while j>0 Q= i÷j, R= i%j y= y_{2}- y_{1}××Q i= j, j= R, y_{2}= y_{1}, y_{1}= y if i≠1 return ERROR return y_{2}≥0 ? y_{2}: y_{2}+a

modinv( z, a ): verify 0 < z < a i= z, j= a%%z, yWe now need the following multi-precision operators:_{2}=1, y_{1}= -a÷÷z while j>0 Q= i÷j, R= i%j y= y_{2}- y_{1}××Q i= j, j= R, y_{2}= y_{1}, y_{1}= y if i≠1 return ERROR return y_{2}≥0 ? y_{2}: y_{2}+a

- Division by scalar, yielding a multi-precision quotient and a single-word remainder.
- Multiplication by a word.
- Addition.
- Subtraction.

Let's make an example division A/F.

A= aNow, let's look at all the multiples of F we need to optionally subtract from A:_{3}a_{2}a_{1}a_{0}b_{3}b_{2}b_{1}b_{0}c_{3}c_{2}c_{1}c_{0}d_{3}d_{2}d_{1}d_{0}F= f_{3}f_{2}f_{1}f_{0}

aLet's assume an in-place divider where A is modified at each round to contain the remainder from the last subtraction._{3}a_{2}a_{1}a_{0}b_{3}b_{2}b_{1}b_{0}c_{3}c_{2}c_{1}c_{0}d_{3}d_{2}d_{1}d_{0}f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }0_{ }f_{3}f_{2}f_{1}f_{0}

As you can see, we have a lot of zeros involved and the significant bits
of the subtracted value always fits in two machine words.
Zeros to the right of the significant two words have no effect at all. We
don't need to store or process them. Zeros to the left don't need to be
subtracted either. Since we subtract a value from the accumulator only
when the accumulator value is already larger, then we guarantee that there
will be no borrow past the most significant bit of the subtracted value,
the position of f_{3}.

The basic algorithm can be summarized as follows, for a big-endian storage format:

100: div(A,F): 101: hi = F << #leading-zeros(F), lo = 0 102: Q= 0, bit= |A|*bits_per_word-#bits(F) 103: P= A 104: while P+1 < A + |A| 105: if P[0]:P[1] ≥ hi:lo then P[0]:P[1] -= hi:lo, Q(bit)= 1 106: bit-- 107: if lo=F then (hi,lo)= (F,0), P= P+1 108: hi:lo >>= 1 110: R= A[|A|-1]The loop leaves P pointing at the least significant word of A.

It's worth noting that when we make a subtraction at (105), the highest set bit of P[0]:P[1] gets cleared. That is, at all times P[0]:P[1] ≤ 2(hi:lo)-1. If this wasn't true, then we must have missed the subtraction opportunity in the previous iteration of the loop, which is impossible.

At this point, all the words except the least significant word of A has been zeroed out by the subtractions. Which leads me to the question, why modify A at all? Since we won't be using those zeroed-out words of A later on, we don't need to store our intermediate values in A. We can simply store them in some temporary variables just like we did for F. So, here is the updated algorithm:

100: div(A,F): 101: Fhi:Flo= F<<#leading-zeros(F):0 102: Q= 0, bit= |A|*bits_per_word-#bits(F) 103: P= A, Ahi:Alo= P[0]:P[1] 104: while P+1 < A + |A| 105: if Ahi:Alo ≥ Fhi:Flo then Ahi:Alo-= Fhi:Flo, Q(bit)= 1 106: bit-- 108: if Flo=F then (Fhi,Flo)= (F,0) 109: P= P+1, 110: (Ahi,Alo)= (Alo,P[1]) 111: Fhi:Flo >>= 1 112: R= AhiThe above algorithm has a slight error. Access to P[1] is not guaranteed at (103) and (110). These should be made after checking that P[1] is indeed within bounds.

- Wikipedia entry
- NIST.FIPS.186 Contains a lot of commonly used algorithms.