// lX= length of X. X= address of LSW 100: function R = <A, lA> × <B, lB> 101: for(iB=0;iB!=lB;iB++) 102: Hp= 0 103: C= 0 104: for(iA=0;iA!=lA;iA++) 105: H:L= A[iA] * B[iB] 106: C:R[iA+iB]= R[iA+iB] + L + Hp + C 107: Hp= H 108: R[iA+iB]= Hp+C // actually R[lA+iB]= .. // since iA=lA at this pointSince the addition at (106) has four operands, the maximum value of the carry is 2. This requires 2 bits, which can't be handled by most CPUs flags register. Therefore, C needs its own register and addition instruction.
In (108), we directly set the value of R[iA+iB] without adding its previous value to Hp+C. This is because it's the first time we actually touch R[iA+iB]. The array R is assumed to be all zeros initially, therefore adding wouldn't make a difference.
At this point, we're guaranteed to not generate a carry to be propagated further. This is because R[lA+iB .. 0] is the result of A[lA-1..0]×B[iB..0]. This product is guaranteed to fit in lA+iB+1 words. Therefore, there can be no carry.
We can optimize this function by making use of the instruction "inc" for the x86-64 architecture. This instruction increments an index and simultaneously compares it to zero. For this to work, we need to translate A by lA, then use negative indices for iA. When we do that, we also need to translate R by lA.
100: function R = <A, lA> × <B, lB> 101: A += lA 102: R += lA 103: for(iB=0;iB!=lB;iB++) 104: Hp= 0 105: C= 0 106: for(iA=-lA;iA!=0;iA++) 107: H:L= A[iA] * B[iB] 108: C:R[iA+iB]= R[iA+iB] + L + Hp + C 109: Hp= H 110: R[iA+iB]= Hp+CAt (110), since we know that iA=0, we can replace it with 0.
110: R[iB]= Hp+CNow let's examine the outer loop body. In there, accesses to R always has an iB component. Instead of R[iA+iB], we could just easily write (R+iB)[iA]. This means that, we can increment R in parallel with iB, and everything would work the same.
100: function R = <A, lA> × <B, lB> 101: A += lA 102: R += lA 103: for(iB=0;iB!=lB;iB++) 104: Hp= 0 105: C= 0 106: for(iA=-lA;iA!=0;iA++) 107: H:L= A[iA] * B[iB] 108: C:R[iA]= R[iA] + L + Hp + C 109: Hp= H 110: R[0]= Hp+C 111: R++Now we can do the "inc" trick for the outer loop as well. Nothing but B[iB] refers to iB. Therefore, simply updating the loop and the single access will suffice.
100: function R = <A, lA> × <B, lB> 101: R += lA 102: A += lA 103: B += lB 104: for(iB=-lB;iB!=0;iB++) 105: Hp= 0 106: C= 0 107: for(iA=-lA;iA!=0;iA++) 108: H:L= A[iA] * B[iB] 109: C:R[iA]= R[iA] + L + Hp + C 110: Hp= H 111: R[0]= Hp+C 112: R++This can be further optimized into pseudo-assembly. The variable iB can be aliased to lB, since the value of lB is not needed after the initial setup.
100: function R = <A, lA> × <B, lB> 101: R += lA 102: A += lA 103: B += lB 104: lA = -lA 105: lB = -lB 106: lB-- 107: goto outer_next 108: outer: 109: Hp= 0 110: C= 0 111: iA= lA 112: iA-- 113: goto inner_next 114: inner: 115: H:L= A[iA] * B[lB] 116: C:R[iA]= R[iA] + L + Hp + C 117: Hp= H 118: inner_next: 119: iA++ 120: if (iA!=0) goto inner 121: R[0]= Hp+C 122: R++ 123: outer_next: 124: lB++ 125: if (lB!=0) goto outerHere is an idiom used for optimizing the loops.
for(x=-V;x!=0;x++) ...body...is translated into:
200: x= -V 201: x-- 202: goto loop_next 203: loop: 204 .. body .. 205: loop_next: 206: x++ 207: if (x!=0) goto loopThe decrement at (201) and the increment at (206) balance each other out. This eliminates the comparison against zero at the start of the loop, which also prevents the corruption of the carry flag. Although the carry flag is not used in this code, I think this is a good way of standardizing loops.
We could also re-arrange it as follows:
200: x= -V 201: x-- 202: loop: 203: x++ 204: if (x==0) goto loop_exit 205 .. body .. 206: goto loop 206: loop_exit:However, this has two jump instructions within the loop body: once for exit, once for iteration. This can be fixed by making a combination of the two, but our savings for the initial test is now gone, and the code is more complex.
200: x= -V 201: x-- 202: x++ 203: if (x==0) goto loop_exit 204: loop: 205 .. body .. 206: x++ 207: if (x!=0) goto loop 208: loop_exit:
function <R, lR> += <A, lA> C= 0 for(i=0;i<lA;i++) C:R[i]= R[i] + A[i] + C for(;C && i<lR;i++) C:R[i]= R[i] + CNow, we know that |R|≥|A| because it's used in (8) directly and in (17) indirectly as part of ⊕=. In both cases this is true.
We can do the "inc trick" on the first loop.
function <R, lR> += <A, lA> C= 0 A+= lA R+= lA lR -= lA for(i=-lA;i!=0;i++) C:R[i]= R[i] + A[i] + C for(;C && i<lR;i++) C:R[i]= R[i] + CNow, we can also do it for the second loop, but we need a different base for that.
function <R, lR> += <A, lA> C= 0 A+= lA R+= lA lR -= lA K = R + lR for(i=-lA;i!=0;i++) C:R[i]= R[i] + A[i] + C for(i=-lR;C && i!=0;i++) C:K[i]= K[i] + CThe carry is one bit since there is only one (carry-included) addition. Therefore, the carry bit in the flags register is perfectly suitable to store this value.
We can now optimize it to our pseudo-assembly code. As before, i can be aliased to lA and lR in the two loops.
function <R, lR> += <A, lA> C= 0 A+= lA R+= lA lR -= lA K = R + lR lA = -lA lR = -lR lA-- goto loop1_next loop1: C:R[lA]= R[lA] + A[lA] + C loop1_next: lA++ if (lA!=0) goto loop1 lR-- goto loop2_next loop2: if (!C) return C:K[lR]= K[lR] + C loop2_next: lR++ if (lR!=0) goto loop2
13: Sa,Ta = a0 ⊖ a1 14: Sb,Tb = b1 ⊖ b0Here, we know that |a1|≥|a0| but |b0| maybe more or less than |b1|. Therefore, the ⊖ operator needs to make a comparison between these two. We can combine comparison and subtraction into one loop. After the subtraction is done, if we have a carry we negate the result and flip the sign bit.
100: function S,R = <A,lA> ⊖ <B,lB> 102: S= 0 103: if lA < lB : <A,lA> ↔ <B,lB>, S=1 104: C= 0 105: for(i=0;i<lB;i++) C:R[i]= A[i]-B[i]-C 107: for(;i<lA;i++) C:R[i]= A[i]-C 109: if !C return 110 C= 0 110: for(i=0;i<lA;i++) C:R[i]= 0 - R[i] - C 112: S ^= 1The carry value C is just one bit, carry bit of the flags register can handle it. Optimizing the loops is therefore beneficial. We now have two more pointers, K and U. These point to the end of A and R, respectively.
100: function S,R = <A,lA> ⊖ <B,lB> 101: S= 0 102: if lA < lB : <A,lA> ↔ <B,lB>, S=1 103: C= 0 104: K = A + lA 105: U = R + lA 106: R += lB 107: A += lB 108: B += lB 109: i= -lB 110: i-- 111: goto loop1_next 112: loop1: 113: C:R[i]= A[i]-B[i]-C 114: loop1_next: 115: i++ 116: if (i!=0) goto loop1 117: i= -(lA-lB) = lB-lA 118: i-- 119: goto loop2_next 120: loop2: 121: C:U[i]= K[i]-C 122: loop2_next: 123: i++ 124: if (i!=0) goto loop2 125: if !C return 126: C= 0 127: i= -lA 128: i-- 129: goto loop3_next 130: loop3: 131: C:U[i]= 0-U[i]-C 132: loop3_next: 133: i++ 134: if (i!=0) goto loop3 135: S ^= 1There is a lot to be optimized in this code. First of all, assignments to i in various places will mess up the carry bit. These should be done before carry is set. Also, the variables A and R are never used after loop1 is done. Therefore, K and U can be aliased to these with some care.
In the first loop, R, B and A are all supposed to be advanced by lB words. For the rest of the loops, R and A are supposed to be further advanced by lA-lB words. If we store this difference somewhere, we can make good use of it.
100: function S,R = <A,lA> ⊖ <B,lB> 101: S= 0 102: if lA < lB : <A,lA> ↔ <B,lB>, S=1 103: D= lA-lB 104: R += lB 105: A += lB 106: B += lB 107: i= -lB 108: C= 0 109: i-- 110: goto loop1_next 111: loop1: 112: C:R[i]= A[i]-B[i]-C 113: loop1_next: 114: i++ 115: if (i!=0) goto loop1 116: R+= D 117: A+= D 118: i= -(lA-lB) = lB-lA= -D 119: i-- 120: goto loop2_next 121: loop2: 122: C:R[i]= A[i]-C 123: loop2_next: 124: i++ 125: if (i!=0) goto loop2 126: if !C return 127: C= 0 128: i= -lA 129: i-- 130: goto loop3_next 131: loop3: 132: C:R[i]= 0-R[i]-C 133: loop3_next: 134: i++ 135: if (i!=0) goto loop3 136: S ^= 1In the first loop, i can be replaced with lB, provided that it's negated first.
After line 128, there is no more reference lA. Therefore i can be replaced with lA. If we negate lA early on, around (105) for example, we can get rid of the carry-messing negation.
In the second loop, -D can be used instead of i. Since D also has to be known at this point, we need to store both the positive and negative values of D. With these in mind, let's rewrite:
100: function S,R = <A,lA> ⊖ <B,lB> 101: S= 0 102: if lA < lB : <A,lA> ↔ <B,lB>, S=1 103: D= lA-lB 104: mD= -D 105: R += lB 106: A += lB 107: B += lB 108: lB= -lB 109: lA= -lA 110: C= 0 111: lB-- 112: goto loop1_next 113: loop1: 114: C:R[lB]= A[lB]-B[lB]-C 115: loop1_next: 116: lB++ 117: if (lB!=0) goto loop1 118: R+= D 119: A+= D 120: mD-- 121: goto loop2_next 122: loop2: 123: C:R[mD]= A[mD]-C 124: loop2_next: 125: mD++ 126: if (mD!=0) goto loop2 127: if !C return 128: C= 0 129: lA-- 130: goto loop3_next 131: loop3: 132: C:R[lA]= 0-R[lA]-C 133: loop3_next: 134: lA++ 135: if (lA!=0) goto loop3 136: S ^= 1
mixed_sum(R,M): carry_l= 0 carry_h= 0 for i = 0 .. M-1 Yi= R[M+i] ui= R[2M+i] yi= R[i] Ui= R[3M+i] carry_l:R[M+i]= Yi+ui+yi+carry_l carry_h:R[2M+i]= Yi+ui+Ui+carry_h propagate carry_l to R{2M} propagate carry_h to R{3M}This basic outline is wrong because the number of Ui terms isn't guaranteed to be the same as the others. The number of U terms can be anywhere between 1 and M+2, and this creates a problem: the loop should be divided into two (for short U case) and there should be another loop to process the remaining digits of U (for long U case).
Let's remember that we're trying to compute the following addition.
R= [UUUuuuuuuuuuuYYYYYYYYYYyyyyyyyyyy] z0<<M= [ YYYYYYYYYYyyyyyyyyyy ] //spaces are 0s z2<<M= [ UUUuuuuuuuuuu ]Instead of trying to add everything in one loop, we can first compute Y+u sum, and then add the U/y parts, like this:
R= [UUUuuuuuuuuuuYYYYYYYYYYyyyyyyyyyy] [ YYYYYYYYYYuuuuuuuuuu ] add these, then add: [ UUU ] and then this one: [ yyyyyyyyyy ]In order to preserve the U values, we mustn't propagate the carry beyond R[3M-1] in the first addition. After doing the second addition, we can propagate that carry to R{3M}. So, the algorithm becomes:
mixed_sum(R, lR, M): Ci= 0 Y= R + M u= R + 2M U= R + 3M for(i=0;i<M;i++) { Ci:V= Y[i] + u[i] + Ci Y[i]= V u[i]= V } if (Ci) // propagate Ci to R{2M..3M-1} { C= 1 for(i=0;C && i<M;i++) C:u[i]= u[i]+C } // Ci will now be delayed until later lU= lR-3M C= 0 for(i=0;i<lU;i++) C:R[2M+i]= R[2M+i]+U[i]+C if (Ci) // propagate Ci to R{3M} { C= 1 for(i=0;C && i<lU;i++) C:R[3M+i]= R[3M+i]+C } // add in the y parts.. C= 0 for(i=0;i<M;i++) C:R[M+i]= R[M+i] + R[i]+ C // propagate the carry if (C) { C= 1 for(i=0;C && i<lR-2M;i++) C:R[2M+i]= R[2M+i]+C }As the algorithm suggests, a new basic operation called carry will simplify our lives. It should return the last carry bit generated.
mixed_sum(R, lR, M): Ci= 0 Y= R + M u= R + 2M U= R + 3M for(i=0;i<M;i++) { Ci:V= Y[i] + u[i] + Ci Y[i]= V u[i]= V } carry(R+2M, M, Ci) lU= lR-3M C= 0 for(i=0;i<lU;i++) C:R[2M+i]= R[2M+i]+U[i]+C carry(R+3M, lU, Ci) C= 0 for(i=0;i<M;i++) C:R[M+i]= R[M+i] + R[i]+ C carry(R+2M, lR-2M, C)I can also convert the addition loops into further operators:
100:mixed_sum(R, lR, M): 101: Y= R + M, u= R + 2M, U= R + 3M, lU= lR-3M 102: Ci= add(Y, u, M) 103: copy(u, Y, M) 104: carry(u, M, Ci) 105: C= add(u, U, lU) 106: carry(u+lU, lR-2M-lU, C) 107: carry(U, lU, Ci) 108: C= add(Y, y, M) 109: carry(u, lR-2M, C)In (105) and (106), we add U to u and then propagate the carry to the end. This is in fact the same thing as addition to an accumulator. The same thing applies to (108) and (109). Let's replace those.
100:mixed_sum(R, lR, M): 101: Y= R + M, u= R + 2M, U= R + 3M, lU= lR-3M 102: Ci= add(Y, u, M) 103: copy(u, Y, M) 104: carry(u, M, Ci) 105: <u,lR-2M> += <U, lU> 106: carry(U, lU, Ci) 107: <Y,lR-M> += <y, M>Since the "add" operation is now used only once, we can combine it with the "copy" operation to improve cache performance.
100:mixed_sum(R, lR, M): 101: Y= R + M, u= R + 2M, U= R + 3M, lU= lR-3M 102: Ci= add_copy(Y, u, M) 103: carry(u, M, Ci) 104: <u,lR-2M> += <U, lU> 105: carry(U, lU, Ci) 106: <Y,lR-M> += <y, M>
add_copy(D, S, L): // destination, source, length C= 0 D+= L, S+= L L= -L L-- goto loop_next loop: C:V= D[L] + S[L]+C D[L]= S[L]= V loop_next L++ if (L!=0) goto loop return C
carry(D, lD, C) if (!C) return 0 D+= lD lD= -lD C=1 lD-- goto loop_next loop: if (!C) return 0 D[lD] += C loop_next: lD++ if (lD!=0) goto loop return CThis can be simplified a little bit by removing the assignment and check to C in the initial part:
carry(D, lD, C) D+= lD lD= -lD lD-- goto loop_next loop: if (!C) return 0 D[lD] += C loop_next: lD++ if (lD!=0) goto loop return C
// lX= length of X. X= address of LSW 100: function R = <A, lA> × <B, lB> 101: for(iB=0;iB!=lB;iB++) 102: Hp= 0 103: C= 0 104: for(iA=0;iA!=lA;iA++) 105: H:L= A[iA] * B[iB] 106: C:R[iA+iB]= R[iA+iB] + L + Hp + C 107: Hp= H 108: R[iA+iB]= Hp+CThe problem with this algorithm is the addition at (106). For correct operation, the array R has to be initialized to all zeros. Zeroing out such arrays will fill up our write buffers unnecessarily, I'd like to avoid it if possible.
At each iteration of iB, words R[iB..iB+lA-1] are read and words R[iB..iB+lA] are written to. The last write in the iteration is done at (108). There is a discussion about why this is correct in previous sections.
If I separate the very first iteration of iB, I can skip reading R[0..lA-1] because they are supposed to be zero. So, the equivalent of the statement at (106) becomes:
C:R[iA+iB]= L + Hp + Cfor this first iteration. This avoids reads from R and therefore initializes R[0..lA-1] correctly. The next statement which is the equivalent of (108) will initialize R[lA]. By the end of the first iteration R[0..lA] will be initialized correctly, which is all that is needed for the next iteration.
Therefore, the algorithm becomes:
100: function R = <A, lA> × <B, lB> 101: Hp= 0 102: C= 0 103: for(iA=0;iA!=lA;iA++) 104: H:L= A[iA] * B[0] 105: C:R[iA]= L + Hp + C 106: Hp= H 107: R[iA]= Hp+C 108: for(iB=0;iB!=lB;iB++) 109: Hp= 0 110: C= 0 111: for(iA=0;iA!=lA;iA++) 112: H:L= A[iA] * B[iB] 113: C:R[iA+iB]= R[iA+iB] + L + Hp + C 114: Hp= H 115: R[iA+iB]= Hp+CConverting this to our pseudo-assembly code is pretty straightforward since we can just copy from the original conversion. I'll do that in a minute.
In (103) thru (107), a single bit is sufficient for the carry. However, I can't use the carry flag in the flags register on x86 because the multiplication instruction messes up the carry flag. If I could use mulx, then this wouldn't be a problem.
In any case, here is the optimized version in pseudo-assembly.
100: function R = <A, lA> × <B, lB> 101: R += lA 102: A += lA 103: B += lB 104: lA = -lA 105: lB = -lB 106: lB-- 107: lB++ 108: if (lB==0) return ; 109: Hp= 0 110: C= 0 111: iA= lA 112: iA-- 113: goto first_next 114: first: 115: H:L= A[iA]*B[lB] 116: C:R[iA]= L + Hp + C 117: Hp= H 118: first_next: 119: iA++ 120: if (iA!=0) goto first 121: R[0]= Hp+C 122: R++ 123: goto outer_next 124: outer: 125: Hp= 0 126: C= 0 127: iA= lA 128: iA-- 129: goto inner_next 130: inner: 131: H:L= A[iA] * B[lB] 132: C:R[iA]= R[iA] + L + Hp + C 133: Hp= H 134: inner_next: 135: iA++ 136: if (iA!=0) goto inner 137: R[0]= Hp+C 138: R++ 139: outer_next: 140: lB++ 141: if (lB!=0) goto outerNotice that (121)..(123) is identical to (137)..(139). We can simply jump there from (121).
100: function R = <A, lA> × <B, lB> 101: R += lA 102: A += lA 103: B += lB 104: lA = -lA 105: lB = -lB 106: lB-- 107: lB++ 108: if (lB==0) return ; 109: Hp= 0 110: C= 0 111: iA= lA 112: iA-- 113: goto first_next 114: first: 115: H:L= A[iA]*B[lB] 116: C:R[iA]= L + Hp + C 117: Hp= H 118: first_next: 119: iA++ 120: if (iA!=0) goto first 121: goto outer_continue 122: outer: 123: Hp= 0 124: C= 0 125: iA= lA 126: iA-- 127: goto inner_next 128: inner: 129: H:L= A[iA] * B[lB] 130: C:R[iA]= R[iA] + L + Hp + C 131: Hp= H 132: inner_next: 133: iA++ 134: if (iA!=0) goto inner 135: outer_continue: 136: R[0]= Hp+C 137: R++ 138: outer_next: 139: lB++ 140: if (lB!=0) goto outerI tested this, but it has no effect whatsoever. Here is the comparison graph: