Basic Operations for Karatsuba Algorithm

We need to write the following basic operators:

Simple Multiplication

Here is some pseudo-code to help with implementation in C or assembly.
     // lX= length of X. X= address of LSW
100: function R = <A, lA> × <B, lB>
101:  for(iB=0;iB!=lB;iB++)
102:    Hp= 0
103:    C= 0
104:    for(iA=0;iA!=lA;iA++)
105:      H:L= A[iA] * B[iB]
106:      C:R[iA+iB]= R[iA+iB] + L + Hp + C
107:      Hp= H
108:    R[iA+iB]= Hp+C // actually R[lA+iB]= .. 
                       // since iA=lA at this point
Since the addition at (106) has four operands, the maximum value of the carry is 2. This requires 2 bits, which can't be handled by most CPUs flags register. Therefore, C needs its own register and addition instruction.

In (108), we directly set the value of R[iA+iB] without adding its previous value to Hp+C. This is because it's the first time we actually touch R[iA+iB]. The array R is assumed to be all zeros initially, therefore adding wouldn't make a difference.

At this point, we're guaranteed to not generate a carry to be propagated further. This is because R[lA+iB .. 0] is the result of A[lA-1..0]×B[iB..0]. This product is guaranteed to fit in lA+iB+1 words. Therefore, there can be no carry.

We can optimize this function by making use of the instruction "inc" for the x86-64 architecture. This instruction increments an index and simultaneously compares it to zero. For this to work, we need to translate A by lA, then use negative indices for iA. When we do that, we also need to translate R by lA.

100: function R = <A, lA> × <B, lB>
101:  A += lA
102:  R += lA
103:  for(iB=0;iB!=lB;iB++)
104:    Hp= 0
105:    C= 0
106:    for(iA=-lA;iA!=0;iA++)
107:      H:L= A[iA] * B[iB]
108:      C:R[iA+iB]= R[iA+iB] + L + Hp + C
109:      Hp= H
110:    R[iA+iB]= Hp+C 
At (110), since we know that iA=0, we can replace it with 0.
110:    R[iB]= Hp+C 
Now let's examine the outer loop body. In there, accesses to R always has an iB component. Instead of R[iA+iB], we could just easily write (R+iB)[iA]. This means that, we can increment R in parallel with iB, and everything would work the same.
100: function R = <A, lA> × <B, lB>
101:  A += lA
102:  R += lA
103:  for(iB=0;iB!=lB;iB++)
104:    Hp= 0
105:    C= 0
106:    for(iA=-lA;iA!=0;iA++)
107:      H:L= A[iA] * B[iB]
108:      C:R[iA]= R[iA] + L + Hp + C
109:      Hp= H
110:    R[0]= Hp+C 
111:    R++
Now we can do the "inc" trick for the outer loop as well. Nothing but B[iB] refers to iB. Therefore, simply updating the loop and the single access will suffice.
100: function R = <A, lA> × <B, lB>
101:  R += lA
102:  A += lA
103:  B += lB
104:  for(iB=-lB;iB!=0;iB++)
105:    Hp= 0
106:    C= 0
107:    for(iA=-lA;iA!=0;iA++)
108:      H:L= A[iA] * B[iB]
109:      C:R[iA]= R[iA] + L + Hp + C
110:      Hp= H
111:    R[0]= Hp+C
112:    R++
This can be further optimized into pseudo-assembly. The variable iB can be aliased to lB, since the value of lB is not needed after the initial setup.
100: function R = <A, lA> × <B, lB>
101:    R += lA
102:    A += lA
103:    B += lB
104:    lA = -lA
105:    lB = -lB
106:    lB--
107:    goto outer_next
108: outer:
109:    Hp= 0
110:    C= 0
111:    iA= lA
112:    iA--
113:    goto inner_next
114: inner:
115:      H:L= A[iA] * B[lB]
116:      C:R[iA]= R[iA] + L + Hp + C
117:      Hp= H
118: inner_next:
119:      iA++
120:      if (iA!=0) goto inner
121:    R[0]= Hp+C
122:    R++
123: outer_next:
124:    lB++
125:    if (lB!=0) goto outer
Here is an idiom used for optimizing the loops.
  for(x=-V;x!=0;x++)
    ...body...
is translated into:
200:   x= -V
201:   x--
202:   goto loop_next
203: loop:
204    .. body ..
205: loop_next:
206:   x++
207:   if (x!=0) goto loop
The decrement at (201) and the increment at (206) balance each other out. This eliminates the comparison against zero at the start of the loop, which also prevents the corruption of the carry flag. Although the carry flag is not used in this code, I think this is a good way of standardizing loops.

We could also re-arrange it as follows:

200:   x= -V
201:   x--
202: loop:
203:   x++
204:   if (x==0) goto loop_exit
205    .. body ..
206:   goto loop
206: loop_exit:
However, this has two jump instructions within the loop body: once for exit, once for iteration. This can be fixed by making a combination of the two, but our savings for the initial test is now gone, and the code is more complex.
200:   x= -V
201:   x--
202:   x++
203:   if (x==0) goto loop_exit
204: loop:
205    .. body ..
206:   x++
207:   if (x!=0) goto loop
208: loop_exit:

Addition to Accumulator

The accumulator R has been filled with some non-zero stuff beforehand. Thus we need to continue the carry until its last digit if necessary. If it still overflows there, we don't need to do anything because we're only interested in the given number of digits anyway.
function <R, lR> += <A, lA> 
  C= 0
  for(i=0;i<lA;i++)
    C:R[i]= R[i] + A[i] + C
  for(;C && i<lR;i++)
    C:R[i]= R[i] + C
Now, we know that |R|≥|A| because it's used in (8) directly and in (17) indirectly as part of ⊕=. In both cases this is true.

We can do the "inc trick" on the first loop.

function <R, lR> += <A, lA> 
  C= 0
  A+= lA
  R+= lA
  lR -= lA
  for(i=-lA;i!=0;i++)
    C:R[i]= R[i] + A[i] + C
  for(;C && i<lR;i++)
    C:R[i]= R[i] + C
Now, we can also do it for the second loop, but we need a different base for that.
function <R, lR> += <A, lA> 
  C= 0
  A+= lA
  R+= lA
  lR -= lA
  K = R + lR
  for(i=-lA;i!=0;i++)
    C:R[i]= R[i] + A[i] + C
  for(i=-lR;C && i!=0;i++)
    C:K[i]= K[i] + C
The carry is one bit since there is only one (carry-included) addition. Therefore, the carry bit in the flags register is perfectly suitable to store this value.

We can now optimize it to our pseudo-assembly code. As before, i can be aliased to lA and lR in the two loops.

function <R, lR> += <A, lA> 
  C= 0
  A+= lA
  R+= lA
  lR -= lA
  K = R + lR
  lA = -lA
  lR = -lR

  lA--
  goto loop1_next
loop1:
    C:R[lA]= R[lA] + A[lA] + C
loop1_next:
    lA++
    if (lA!=0) goto loop1

  lR--
  goto loop2_next
loop2:
    if (!C) return 
    C:K[lR]= K[lR] + C
loop2_next:
    lR++
    if (lR!=0) goto loop2

Subtraction from Accumulator

This is not present by itself, but will be used as part of the ⊕= operator. Its code is the same as addition, but the values are subtracted instead of being added.

Fancy Subtraction

The ⊖ operator needs to compute the absolute value and the sign of the difference of its operands. It's used in two places:
 13:   Sa,Ta = a0 ⊖ a1
 14:   Sb,Tb = b1 ⊖ b0
Here, we know that |a1|≥|a0| but |b0| maybe more or less than |b1|. Therefore, the ⊖ operator needs to make a comparison between these two. We can combine comparison and subtraction into one loop. After the subtraction is done, if we have a carry we negate the result and flip the sign bit.
100: function S,R = <A,lA> ⊖ <B,lB>
102:   S= 0
103:   if lA < lB  : <A,lA> ↔ <B,lB>, S=1
104:   C= 0
105:   for(i=0;i<lB;i++) C:R[i]= A[i]-B[i]-C
107:   for(;i<lA;i++)    C:R[i]= A[i]-C
109:   if !C return 
110    C= 0
110:   for(i=0;i<lA;i++) C:R[i]= 0 - R[i] - C
112:   S ^= 1
The carry value C is just one bit, carry bit of the flags register can handle it. Optimizing the loops is therefore beneficial. We now have two more pointers, K and U. These point to the end of A and R, respectively.
100: function S,R = <A,lA> ⊖ <B,lB>
101:        S= 0
102:        if lA < lB  : <A,lA> ↔ <B,lB>, S=1
103:        C= 0
104:        K = A + lA
105:        U = R + lA
106:        R += lB
107:        A += lB
108:        B += lB
109:        i= -lB
110:        i--
111:        goto loop1_next
112: loop1:
113:        C:R[i]= A[i]-B[i]-C
114: loop1_next:
115:        i++
116:        if (i!=0) goto loop1
117:        i= -(lA-lB) = lB-lA
118:        i--
119:        goto loop2_next
120: loop2:
121:        C:U[i]= K[i]-C 
122: loop2_next:
123:        i++
124:        if (i!=0) goto loop2
125:        if !C return 
126:        C= 0
127:        i= -lA
128:        i--
129:        goto loop3_next
130: loop3:
131:        C:U[i]= 0-U[i]-C
132: loop3_next:
133:        i++
134:        if (i!=0) goto loop3
135:        S ^= 1
There is a lot to be optimized in this code. First of all, assignments to i in various places will mess up the carry bit. These should be done before carry is set. Also, the variables A and R are never used after loop1 is done. Therefore, K and U can be aliased to these with some care.

In the first loop, R, B and A are all supposed to be advanced by lB words. For the rest of the loops, R and A are supposed to be further advanced by lA-lB words. If we store this difference somewhere, we can make good use of it.

100: function S,R = <A,lA> ⊖ <B,lB>
101:        S= 0
102:        if lA < lB  : <A,lA> ↔ <B,lB>, S=1
103:        D= lA-lB
104:        R += lB
105:        A += lB
106:        B += lB
107:        i= -lB
108:        C= 0
109:        i--
110:        goto loop1_next
111: loop1:
112:        C:R[i]= A[i]-B[i]-C
113: loop1_next:
114:        i++
115:        if (i!=0) goto loop1
116:        R+= D
117:        A+= D
118:        i= -(lA-lB) = lB-lA= -D
119:        i--
120:        goto loop2_next
121: loop2:
122:        C:R[i]= A[i]-C 
123: loop2_next:
124:        i++
125:        if (i!=0) goto loop2
126:        if !C return 
127:        C= 0
128:        i= -lA
129:        i--
130:        goto loop3_next
131: loop3:
132:        C:R[i]= 0-R[i]-C
133: loop3_next:
134:        i++
135:        if (i!=0) goto loop3
136:        S ^= 1
In the first loop, i can be replaced with lB, provided that it's negated first.

After line 128, there is no more reference lA. Therefore i can be replaced with lA. If we negate lA early on, around (105) for example, we can get rid of the carry-messing negation.

In the second loop, -D can be used instead of i. Since D also has to be known at this point, we need to store both the positive and negative values of D. With these in mind, let's rewrite:

100: function S,R = <A,lA> ⊖ <B,lB>
101:        S= 0
102:        if lA < lB  : <A,lA> ↔ <B,lB>, S=1
103:        D= lA-lB
104:        mD= -D
105:        R += lB
106:        A += lB
107:        B += lB
108:        lB= -lB
109:        lA= -lA
110:        C= 0
111:        lB--
112:        goto loop1_next
113: loop1:
114:        C:R[lB]= A[lB]-B[lB]-C
115: loop1_next:
116:        lB++
117:        if (lB!=0) goto loop1
118:        R+= D
119:        A+= D
120:        mD--
121:        goto loop2_next
122: loop2:
123:        C:R[mD]= A[mD]-C 
124: loop2_next:
125:        mD++
126:        if (mD!=0) goto loop2
127:        if !C return 
128:        C= 0
129:        lA--
130:        goto loop3_next
131: loop3:
132:        C:R[lA]= 0-R[lA]-C
133: loop3_next:
134:        lA++
135:        if (lA!=0) goto loop3
136:        S ^= 1

Fancy Addition

The ⊕= operator is pretty simple. If the given sign is positive (=0), then addition to accumulator is done. Otherwise (sign=1), subtraction from accumulator is done.

Mixed Sum

This is the most complicated of them all. The basic algorithm was outlined before. Let's remember it and discuss.
mixed_sum(R,M):
  carry_l= 0
  carry_h= 0
  for i = 0 .. M-1
    Yi= R[M+i]
    ui= R[2M+i]
    yi= R[i]
    Ui= R[3M+i]
    carry_l:R[M+i]= Yi+ui+yi+carry_l
    carry_h:R[2M+i]= Yi+ui+Ui+carry_h
  propagate carry_l to R{2M}
  propagate carry_h to R{3M}
This basic outline is wrong because the number of Ui terms isn't guaranteed to be the same as the others. The number of U terms can be anywhere between 1 and M+2, and this creates a problem: the loop should be divided into two (for short U case) and there should be another loop to process the remaining digits of U (for long U case).

Let's remember that we're trying to compute the following addition.

     R=   [UUUuuuuuuuuuuYYYYYYYYYYyyyyyyyyyy]
 z0<<M=   [   YYYYYYYYYYyyyyyyyyyy          ]  //spaces are 0s
 z2<<M=   [          UUUuuuuuuuuuu          ]
Instead of trying to add everything in one loop, we can first compute Y+u sum, and then add the U/y parts, like this:
     R=   [UUUuuuuuuuuuuYYYYYYYYYYyyyyyyyyyy]
          [   YYYYYYYYYYuuuuuuuuuu          ]
 add these, then add:
          [          UUU                    ]
 and then this one:
          [             yyyyyyyyyy          ]
In order to preserve the U values, we mustn't propagate the carry beyond R[3M-1] in the first addition. After doing the second addition, we can propagate that carry to R{3M}. So, the algorithm becomes:
mixed_sum(R, lR, M):
  Ci= 0
  Y= R + M
  u= R + 2M
  U= R + 3M
  for(i=0;i<M;i++) 
  {
    Ci:V= Y[i] + u[i] + Ci
    Y[i]= V
    u[i]= V
  }
  if (Ci) // propagate Ci to R{2M..3M-1}
  {
    C= 1
    for(i=0;C && i<M;i++)  C:u[i]= u[i]+C
  }
  // Ci will now be delayed until later
  lU= lR-3M
  C= 0
  for(i=0;i<lU;i++) C:R[2M+i]= R[2M+i]+U[i]+C
  if (Ci) // propagate Ci to R{3M}
  {
    C= 1
    for(i=0;C && i<lU;i++) C:R[3M+i]= R[3M+i]+C
  }
  // add in the y parts..
  C= 0
  for(i=0;i<M;i++) C:R[M+i]= R[M+i] + R[i]+ C
  // propagate the carry
  if (C)
  {
    C= 1
    for(i=0;C && i<lR-2M;i++) C:R[2M+i]= R[2M+i]+C
  }
As the algorithm suggests, a new basic operation called carry will simplify our lives. It should return the last carry bit generated.
mixed_sum(R, lR, M):
  Ci= 0
  Y= R + M
  u= R + 2M
  U= R + 3M
  for(i=0;i<M;i++) 
  {
    Ci:V= Y[i] + u[i] + Ci
    Y[i]= V
    u[i]= V
  }
  carry(R+2M, M, Ci)
  lU= lR-3M
  C= 0
  for(i=0;i<lU;i++) C:R[2M+i]= R[2M+i]+U[i]+C
  carry(R+3M, lU, Ci)
  C= 0
  for(i=0;i<M;i++) C:R[M+i]= R[M+i] + R[i]+ C
  carry(R+2M, lR-2M, C)
I can also convert the addition loops into further operators:
100:mixed_sum(R, lR, M):
101:  Y= R + M, u= R + 2M, U= R + 3M, lU= lR-3M
102:  Ci= add(Y, u, M)
103:  copy(u, Y, M)
104:  carry(u, M, Ci)
105:  C= add(u, U, lU) 
106:  carry(u+lU, lR-2M-lU, C)
107:  carry(U, lU, Ci)
108:  C= add(Y, y, M)
109:  carry(u, lR-2M, C)
In (105) and (106), we add U to u and then propagate the carry to the end. This is in fact the same thing as addition to an accumulator. The same thing applies to (108) and (109). Let's replace those.
100:mixed_sum(R, lR, M):
101:  Y= R + M, u= R + 2M, U= R + 3M, lU= lR-3M
102:  Ci= add(Y, u, M)
103:  copy(u, Y, M)
104:  carry(u, M, Ci)
105:  <u,lR-2M> += <U, lU> 
106:  carry(U, lU, Ci)
107:  <Y,lR-M> += <y, M>
Since the "add" operation is now used only once, we can combine it with the "copy" operation to improve cache performance.
100:mixed_sum(R, lR, M):
101:  Y= R + M, u= R + 2M, U= R + 3M, lU= lR-3M
102:  Ci= add_copy(Y, u, M)
103:  carry(u, M, Ci)
104:  <u,lR-2M> += <U, lU> 
105:  carry(U, lU, Ci)
106:  <Y,lR-M> += <y, M>

add_copy

This is pretty straightforward. Both the destination and source arrays are updated with the sum. The length L is the same for both arrays.
add_copy(D, S, L):  // destination, source, length
  C= 0
  D+= L, S+= L
  L= -L
  L--
  goto loop_next
loop:
  C:V= D[L] + S[L]+C
  D[L]= S[L]= V
loop_next
  L++
  if (L!=0) goto loop
  return C

Carry

This operator propagates a carry for a given number of words. It returns the last carry bit generated.
carry(D, lD, C)
  if (!C) return 0
  D+= lD
  lD= -lD
  C=1
  lD--
  goto loop_next
loop:
  if (!C) return 0
  D[lD] += C
loop_next:
  lD++
  if (lD!=0) goto loop
  return C
This can be simplified a little bit by removing the assignment and check to C in the initial part:
carry(D, lD, C)
  D+= lD
  lD= -lD
  lD--
  goto loop_next
loop:
  if (!C) return 0
  D[lD] += C
loop_next:
  lD++
  if (lD!=0) goto loop
  return C

Improvement to the Simple Multiplication

Here is the initial version I discussed above.
     // lX= length of X. X= address of LSW
100: function R = <A, lA> × <B, lB>
101:  for(iB=0;iB!=lB;iB++)
102:    Hp= 0
103:    C= 0
104:    for(iA=0;iA!=lA;iA++)
105:      H:L= A[iA] * B[iB]
106:      C:R[iA+iB]= R[iA+iB] + L + Hp + C
107:      Hp= H
108:    R[iA+iB]= Hp+C
The problem with this algorithm is the addition at (106). For correct operation, the array R has to be initialized to all zeros. Zeroing out such arrays will fill up our write buffers unnecessarily, I'd like to avoid it if possible.

At each iteration of iB, words R[iB..iB+lA-1] are read and words R[iB..iB+lA] are written to. The last write in the iteration is done at (108). There is a discussion about why this is correct in previous sections.

If I separate the very first iteration of iB, I can skip reading R[0..lA-1] because they are supposed to be zero. So, the equivalent of the statement at (106) becomes:

      C:R[iA+iB]= L + Hp + C
for this first iteration. This avoids reads from R and therefore initializes R[0..lA-1] correctly. The next statement which is the equivalent of (108) will initialize R[lA]. By the end of the first iteration R[0..lA] will be initialized correctly, which is all that is needed for the next iteration.

Therefore, the algorithm becomes:

100: function R = <A, lA> × <B, lB>
101:  Hp= 0
102:  C= 0
103:  for(iA=0;iA!=lA;iA++)
104:    H:L= A[iA] * B[0]
105:    C:R[iA]= L + Hp + C
106:    Hp= H
107:  R[iA]= Hp+C
108:  for(iB=0;iB!=lB;iB++)
109:    Hp= 0
110:    C= 0
111:    for(iA=0;iA!=lA;iA++)
112:      H:L= A[iA] * B[iB]
113:      C:R[iA+iB]= R[iA+iB] + L + Hp + C
114:      Hp= H
115:    R[iA+iB]= Hp+C 
Converting this to our pseudo-assembly code is pretty straightforward since we can just copy from the original conversion. I'll do that in a minute.

In (103) thru (107), a single bit is sufficient for the carry. However, I can't use the carry flag in the flags register on x86 because the multiplication instruction messes up the carry flag. If I could use mulx, then this wouldn't be a problem.

In any case, here is the optimized version in pseudo-assembly.

100: function R = <A, lA> × <B, lB>
101:    R += lA
102:    A += lA
103:    B += lB
104:    lA = -lA
105:    lB = -lB
106:    lB--
107:    lB++
108:    if (lB==0) return ;
109:    Hp= 0
110:    C= 0
111:    iA= lA
112:    iA--
113:    goto first_next
114: first:
115:    H:L= A[iA]*B[lB]
116:    C:R[iA]= L + Hp + C
117:    Hp= H
118: first_next:
119:    iA++
120:    if (iA!=0) goto first
121:    R[0]= Hp+C
122:    R++
123:    goto outer_next
124: outer:
125:    Hp= 0
126:    C= 0
127:    iA= lA
128:    iA--
129:    goto inner_next
130: inner:
131:      H:L= A[iA] * B[lB]
132:      C:R[iA]= R[iA] + L + Hp + C
133:      Hp= H
134: inner_next:
135:      iA++
136:      if (iA!=0) goto inner
137:    R[0]= Hp+C
138:    R++
139: outer_next:
140:    lB++
141:    if (lB!=0) goto outer
Notice that (121)..(123) is identical to (137)..(139). We can simply jump there from (121).
100: function R = <A, lA> × <B, lB>
101:    R += lA
102:    A += lA
103:    B += lB
104:    lA = -lA
105:    lB = -lB
106:    lB--
107:    lB++
108:    if (lB==0) return ;
109:    Hp= 0
110:    C= 0
111:    iA= lA
112:    iA--
113:    goto first_next
114: first:
115:    H:L= A[iA]*B[lB]
116:    C:R[iA]= L + Hp + C
117:    Hp= H
118: first_next:
119:    iA++
120:    if (iA!=0) goto first
121:    goto outer_continue
122: outer:
123:    Hp= 0
124:    C= 0
125:    iA= lA
126:    iA--
127:    goto inner_next
128: inner:
129:      H:L= A[iA] * B[lB]
130:      C:R[iA]= R[iA] + L + Hp + C
131:      Hp= H
132: inner_next:
133:      iA++
134:      if (iA!=0) goto inner
135: outer_continue:
136:    R[0]= Hp+C
137:    R++
138: outer_next:
139:    lB++
140:    if (lB!=0) goto outer
I tested this, but it has no effect whatsoever. Here is the comparison graph:

I tested with even input size at 1 million digits. There's still no difference. I will keep this as my × operator because it might have an effect on hardware with less cache.