Conversion from YCbCr to RGB

YCbCr is a digital video format used by cameras. RGB is the format used for display devices. How do we convert from one to another? The answer is in [1]. That document specifies how to compute YCbCr from RGB, but we can convert back, can't we?

[1] gives the definitions for conversion from RGB to YCbCr as follows:

        E'_Y= 0.299*E'_R + 0.587*E'_G + 0.114*E'_B
  (E'_R-E'_Y)= 0.701*E'_R-0.587*E'_G-0.114*E'_B
  (E'_B-E'_Y)= -0.299*E'_R-0.587*E'_G+0.886*E'_B

Here, E'_R, E'_B and E'_G are gamma corrected RGB intensities within the range [0.0,1.0]. E'_Y is the intensity channel and it has the range [0.0,1.0]. The other two channels are color channels. These end up with ranges [-0.701,+0.701] and [-0.886,+0.886]. We normalize them to fit within [-0.5,0.5].

   E'_Y= 0.299*E'_R + 0.587*E'_G + 0.114*E'_B
  E'_Cr= (E'_R-E'_Y) / 1.402
  E'_Cb= (E'_B-E'_Y) / 1.772

 ..

   E'_Y= 0.299*E'_R + 0.587*E'_G + 0.114*E'_B
  E'_Cr= 0.701/1.402*E'_R-0.587/1.402*E'_G-0.114/1.402*E'_B
  E'_Cb= -0.299/1.772*E'_R-0.587/1.772*E'_G+0.886/1.772*E'_B

 ..

   E'_Y= 0.299*E'_R + 0.587*E'_G + 0.114*E'_B
  E'_Cr= 0.5*E'_R-0.41869*E'_G-0.08131*E'_B
  E'_Cb= -0.16874*E'_R-0.33127*E'_G+0.5*E'_B

This can be rewritten as a matrix operation.

   | E'_Y  |            | E'_R |
   | E'_Cr |  =    A x  | E'_G |
   | E'_Cb |            | E'_B |

where A is

   |  0.299     0.587      0.114    |
   |  0.5      -0.41869   -0.08131  |
   | -0.16874  -0.33127    0.5      |

If we multiply each side with the inverse of A from the left, we end up with:

        | E'_Y  |               | E'_R |
 A^-1  x | E'_Cr |  =  A^-1 x A x  | E'_G |
        | E'_Cb |               | E'_B |


   | E'_R |              | E'_Y  |
   | E'_G |  =    A^-1 x  | E'_Cr |
   | E'_B |              | E'_Cb |

The matrix Q = A^-1 is:

   | 1   1.402     0       |
   | 1  -0.71414  -0.34414 |
   | 1   0         1.772   |

So, when we get a F= (E'_Y,E'_Cr,E'_Cb) vector, we multiply it with Q and get a K= (E'_R,E'_G,E'_B) vector.

Quantized Values

R,G and B are quantized forms of E'_R, E'_G and E'_B. Similarly, Y, Cr and Cb are quantized forms of E'_Y, E'_Cr and E'_Cb. The ITU recommendation specifies the following ranges for these values:

Value	Min	Max	Levels
Y	16	235	220
Cr	16	240	225
Cb	16	240	225
R	16	235	220
G	16	235	220
B	16	235	220

When converting Y,Cb,Cr values to the corresponding intensities, we must make sure that they are within the above ranges. Also during output, we should clamp the values to the R,G,B ranges.

Since E'_Cr and E'_Cb have the range [-0.5,+0.5], the value Cr=128 is mapped to E'_Cr=0.0. The same applies to Cb=128 and E'_Cb=0.0. Here are the conversion equations after the input has been clamped to the proper ranges:

  E'_Y =   (Y-16)/219
  E'_Cr=   (Cr-128)/224
  E'_Cb=   (Cb-128)/224

        CONV.1

Implementation with Fixed Point Arithmetic

A 16 bit fixed point format should be enough for my purposes. I will have 3 integer bits and 13 fractional bits. The three integer bits give me a range of [-4,3], which will never overflow.

The number 1 will be represented as:

         iiiF FFFF FFFF FFFF
  1 =    0010 0000 0000 0000  =  0x2000  = Fp1

Using this value, I can easily convert any floating point number to a fixed point number:

#define Fp(X) ((int16_t)((X)*Fp1))

Now, we will deal with two kinds of numbers: fixed point numbers (F) and integer numbers (I). Here are the rules governing their use:

     F + I        ==> incorrect
     F + I * Fp1  ==> F
     F - I        ==> incorrect
     F - I * Fp1  ==> F
     F + F        ==> F
     F - F        ==> F
     F * I        ==> F
     F / I        ==> F
     F * F        ==> incorrect
     F * F / Fp1  ==> F
     F / F        ==> incorrect
     (F* Fp1)/F   ==> F
     F / Fp1      ==> I
     I * Fp1      ==> F

We could easily put CONV.1 in tables to save ourselves three divisions and six comparisons for each conversion.

  int16_t  Ytab[256];
  int16_t  Ctab[256];
  for(Y=0;Y<256;Y++)
  {
    if (Y<16) Yp= 16; else if (Y>235) Yp= 235; else Yp= Y;
    Ytab[Y]=  (Yp-16)*Fp1 / 219;
  }
  for(C=0;C<256;C++)
  {
    if (C<16) Cp= 16; else if (C>240) Cp= 240; else Cp= C;
    Ctab[C]= (Cp-128)*Fp1 / 224;
  }

Given these tables, now it's very easy to do the conversion:


#define CLAMP(X,L,H) ( (X)<(L)? (L) : ( (X)>(H) ? (H) : (X) ) )

void YCbCr_to_RGB(uint8_t Y, uint8_t Cb, uint8_t Cr,
                  uint8_t *R, uint8_t *G, uint8_t *B)
{
  int16_t fY, fCb, fCr;
  int32_t fR, fG, fB, iR, iG, iB;

  fY= Ytab[Y]; fCb= Ctab[Cb]; fCr= Ctab[Cr];
  
  fR= fY + Fp(1.402)   *fCr / Fp1;  
  fG= fY - Fp(0.71414) *fCr / Fp1 - Fp(0.34414)*fCb / Fp1; 
  fB= fY + Fp(1.772)   *fCb / Fp1;                

  iR= fR * 255 / Fp1;
  iG= fG * 255 / Fp1;
  iB= fB * 255 / Fp1;

  *R=  CLAMP(iR,RGBlo,RGBhi);
  *G=  CLAMP(iG,RGBlo,RGBhi);
  *B=  CLAMP(iB,RGBlo,RGBhi);
}

Here, (RGBlo,RGBhi) can be defined to either (0,255) or (16,235). The second pair will comply with the ITU recommendation. The first one may provide more colors.

This version works well, but it can be optimized. In the computation of (fY,fCb,fCr) we have 3 memory accesses. For (fR,fG,fB) we have 4 multiplications. If we use 4 tables instead of 1 for the color channels, we can reduce the overall usage to 5 memory accesses and 0 multiplications. I don't know whether this would be faster or not. Theoretically, the memory accesses would hit the cache a lot because the kind of images I'm processing have a lot of smooth areas (real life images from a camera). But this needs to be investigated of course. So, the new function would be:

void YCbCr_to_RGB(uint8_t Y, uint8_t Cb, uint8_t Cr,
                  uint8_t *R, uint8_t *G, uint8_t *B)
{
  int16_t fY, fCb, fCr;
  int32_t fR, fG, fB, iR, iG, iB;

  fY= Ytab[Y]; 
  
  fR= fY + CrTAB1[Cr]; 
  fG= fY - CrTAB2[Cr] - CbTAB1[Cb]; 
  fB= fY + CbTAB2[Cb]; 

  iR= fR * 255 / Fp1;
  iG= fG * 255 / Fp1;
  iB= fB * 255 / Fp1;

  *R=  CLAMP(iR,RGBlo,RGBhi);
  *G=  CLAMP(iG,RGBlo,RGBhi);
  *B=  CLAMP(iB,RGBlo,RGBhi);
}

The initialization function and the tables would look like:

  int16_t  Ytab[256], CrTAB1[256], CrTAB2[256],
                      CbTAB1[256], CbTAB2[256];
void init_tables()
{
  int Y, Yp, C, Cp;
  for(Y=0;Y<256;Y++)
  {
    if (Y<16) Yp= 16; else if (Y>235) Yp= 235; else Yp= Y;
    Ytab[Y]=  (Yp-16)*Fp1 / 219;
  }
  for(C=0;C<256;C++)
  {
    if (C<16) Cp= 16; else if (C>240) Cp= 240; else Cp= C;
    CrTAB1[C]= Fp(1.402   * (Cp-128)/224);
    CrTAB2[C]= Fp(0.71414 * (Cp-128)/224);
    CbTAB1[C]= Fp(0.34414 * (Cp-128)/224);
    CbTAB2[C]= Fp(1.772   * (Cp-128)/224);
  }

In the new function, access to CrTAB1[Cr] will bring in CrTAB1[Cr+1] or CrTAB1[Cr-1] into the cache. This could be useful for the next pixel, or it may not. Instead, I could interleave CrTAB1 and CrTAB2 so that when I access CrTAB1[Cr], CrTAB2[Cr] is already in the cache. I will later make some sort of a test function to see which performs the best. So the new code is:


int16_t  Ytab[256], CrTAB[512], CbTAB[512];

void init_tables()
{
  int Y, Yp, C, Cp;
  for(Y=0;Y<256;Y++)
  {
    if (Y<16) Yp= 16; else if (Y>235) Yp= 235; else Yp= Y;
    Ytab[Y]=  (Yp-16)*Fp1 / 219;
  }
  for(C=0;C<256;C++)
  {
    if (C<16) Cp= 16; else if (C>240) Cp= 240; else Cp= C;
    CrTAB[2*C+0]= Fp(1.402   * (Cp-128)/224);
    CrTAB[2*C+1]= Fp(0.71414 * (Cp-128)/224);
    CbTAB[2*C+0]= Fp(0.34414 * (Cp-128)/224);
    CbTAB[2*C+1]= Fp(1.772   * (Cp-128)/224);
  }
}

void YCbCr_to_RGB(uint8_t Y, uint8_t Cb, uint8_t Cr,
                  uint8_t *R, uint8_t *G, uint8_t *B)
{
  int16_t fY, fCb, fCr;
  int32_t fR, fG, fB, iR, iG, iB;

  fY= Ytab[Y]; 
  
  fR= fY + CrTAB[2*Cr+0]; 
  fG= fY - CrTAB[2*Cr+1] - CbTAB[2*Cb+0]; 
  fB= fY + CbTAB[2*Cb+1]; 

  iR= fR * 255 / Fp1;
  iG= fG * 255 / Fp1;
  iB= fB * 255 / Fp1;

  *R=  CLAMP(iR,RGBlo,RGBhi);
  *G=  CLAMP(iG,RGBlo,RGBhi);
  *B=  CLAMP(iB,RGBlo,RGBhi);
}

Here, it looks like we have more operations because array indices have become more complicated. However, an optimizing compiler will optimize those out since 2*Cr and 2*Cb are common expressions and the +1 operation can be done on the array address instead of the index. Whether my compiler does it needs to be investigated but I think it will.

What I'm interested in is whether the clamping operations are necessary in case I set (RGBlo,RGBhi) to (0,255). If not, speed should skyrocket. It turns out that the clamping is strong. I looped thru all possible values with these additional functions and got some discouraging results:

int clamped_lo, clamped_hi;
int clamp_min= RGBhi; int clamp_max= RGBlo;

int imin2(int a,int b) { return a<b ? a:b; }
int imax2(int a,int b) { return a>b ? a:b; }

int CLAMP(int X,int L,int H)
{
  if (X<L)
    { clamped_lo++; clamp_min=imin2(clamp_min,X); return L; }
  if (X>H)
    { clamped_hi++; clamp_max=imax2(clamp_max,X); return H; }
  return X;
}

int main (int argc, char **argv)
{
  int Y,Cb,Cr;
  uint8_t R,G,B;
  init_tables();

  for(Y=0;Y<256;Y++)
  for(Cb=0;Cb<256;Cb++) 
  for(Cr=0;Cr<256;Cr++)
    conv_YCbCr_to_RGB(Y,Cb,Cr, &R, &G, &B);

  printf("clamped: lo=%d hi=%d\n", clamped_lo, clamped_hi);
  printf("        min=%d max=%d\n", clamp_min, clamp_max);
  return 0;
}

clamped: lo=9514671 hi=9840912
        min=-225 max=480

What's worse is that even gcc-O3 generates jumps for the clamping statements. Here is an excerpt:

	xorl	%eax, %eax
	testl	%edi, %edi
	js	.L25
	cmpl	$255, %edi
	movl	$255, %eax
	cmovle	%edi, %eax
.L25:
# the next instruction is for the next component. ignore it.
	testl	%edx, %edx
	movb	%al, (%r8)

Maybe I can find another way to do the clamping for the range (0,255). If a 32 bit signed number is outside this range, it certainly has some bits set in the upper 24 bits. If the highest bit is set, then the number is negative and the result should be 0. Otherwise, the number is positive and the result should be 255.

Let's shift the number to clamped, say X, 24 times to the right:

   Z =  X >> 24;

If X is negative, Z will be 0xFFFFFFFu, otherwise it will be 0x0000000u where u is the highest nibble of X. In our case the values subject to clamping are already shifted to the right by 13 bits so u will be 0xF for negative X and 0 otherwise. So we have:

negative X, Z= 0xFFFF_FFFF, result should be 0
non-negative X, Z= 0, result should be either X or 0xFF

For non-negative X, the maximum value we got is 480. Therefore, only bit 8 can be set in the upper 24 bits if X is too large. I can propagate this bit to the lower 8 bits:

  D= (X<<23) >> 31;

For all negative values of X we're interested in, bit 8 is also set. These run from -225 to -1, which all fit in 9 bits. Therefore, bit 8 is always 1 for these values. Let's consider only the lower 8 bits of our values from now on. Now we have:

case 1: X < 0, Z= 0xFF, D= 0xFF, R= 0
case 2: 0≤ X ≤ 255, Z= 0, D=0, R= X
case 3: X > 255, Z= 0, D=0xFF, R= 0xFF

  case 1:  D ^ Z  = 0        D & Z = 0xFF   R = 0
  case 2:  D ^ Z  = 0        D & Z = 0      R = X
  case 3:  D ^ Z  = 0xFF     D & Z = 0      R = 0xFF

          R=  (D ^ Z)  |  (X & ~ (D&Z))
  case 1:        0     |  (X & ~ 0xFF )  =  (X & 0) = 0
  case 2:        0     |  (X & ~ 0    )  =  (X & 0xFF) = X
  case 3:       0xff   |  (X & ~ 0    )  =  0xFF

I'm really curious about this now.

#define CLAMPBYTE(T, X, D, Z) \
    Z = (X) >> 24; \
    D = ((X)<<23) >> 31; \
    T = (D^Z) | ((X)&~(D&Z))

void YCbCr_to_RGB(uint8_t Y, uint8_t Cb, uint8_t Cr,
                  uint8_t *R, uint8_t *G, uint8_t *B)
{
  int16_t fY;
  int32_t fR, fG, fB, iR, iG, iB;
  int32_t Dr, Dg, Db, Zr, Zg, Zb;

  fY= Ytab[Y];

  fR= fY + CrTAB[2*Cr+0];
  fG= fY - CrTAB[2*Cr+1] - CbTAB[2*Cb+0];
  fB= fY + CbTAB[2*Cb+1];

  iR= fR * 255 / Fp1;
  iG= fG * 255 / Fp1;
  iB= fB * 255 / Fp1;

  CLAMPBYTE(*R, iR, Dr, Zr);
  CLAMPBYTE(*G, iG, Dg, Zg);
  CLAMPBYTE(*B, iB, Db, Zb);
}

It does work. I don't see any overflowed colors or anything. Nice. Here, D* and Z* are declared once for each component even though it's not really necessary. I made it so that the compiler doesn't miss a chance to realize that computations of different components are independent of each other.

I noticed one thing though. gcc-O3 couldn't optimize out +1s in the array indices. Either it's not allowed or gcc simply didn't realize that it would work. In any case, here is the improved version:

void YCbCr_to_RGB(uint8_t Y, uint8_t Cb, uint8_t Cr,
                  uint8_t *R, uint8_t *G, uint8_t *B)
{
  int16_t fY;
  int32_t fR, fG, fB, iR, iG, iB;
  int32_t Dr, Dg, Db, Zr, Zg, Zb;

  fY= Ytab[Y];

  fR= fY + CrTAB[2*Cr+0];
  fG= fY - (CrTAB+1)[2*Cr] - CbTAB[2*Cb];
  fB= fY + (CbTAB+1)[2*Cb];

  iR= fR * 255 / Fp1;
  iG= fG * 255 / Fp1;
  iB= fB * 255 / Fp1;

  CLAMPBYTE(*R, iR, Dr, Zr);
  CLAMPBYTE(*G, iG, Dg, Zg);
  CLAMPBYTE(*B, iB, Db, Zb);
}

gcc-O3 already optimizes the computations of (iR,iG,iB). These should have been written as:

  iR=  ((fR<<8)-fR) >> 13;
  ..

However, gcc already does exactly the same thing.

Speed Tests

Here are some speed results. Each algorithm was tested with a captured image as well as random data (with the same seed). 1000 conversions were done and the average time for the conversions are shown below. Tested algorithms are as follows.

spd1: Original algorithm with only one color table.
spd2: Four color tables.
spd3: Four color tables interleaved into two.
spd4: Same as spd3, but with branchless clamping.
spd5: Same as spd3, but array accesses and iR,iG,iB computations are optimized.
spd6: Same as spd4, but array accesses and iR,iG,iB computations are optimized.

  image spd1  2800 usec
  image spd2  2608 usec
  image spd3  2454 usec
  image spd4  3233 usec
  image spd5  2018 usec
  image spd6  2723 usec
 random spd1  2829 usec
 random spd2  2606 usec
 random spd3  2492 usec
 random spd4  3264 usec
 random spd5  2039 usec
 random spd6  2761 usec

Surprisingly, branchless code executes slower than the branched one. I guess I have a pretty good branch predictor in my CPU. The elimination of multiplications had a good effect. The first optimization is 7% faster than the original. The second one is even better, 15% faster.

Another surprise comes from the fact that replacing

  iR = fR * 255 / Fp1;

with

  iR= ((fR<<8)-fR)>>13;

Made a huge difference. This is the difference between spd3 and spd5. We are talking about a 20% speed up. Normally, the compiler did figure out that these were equivalent and generated code similar to the one I wrote even for the first form. I don't understand how writing it explicitly helped things. Anyway, here is the winner: spd5.


int16_t  Ytab[256], CrTAB[512], CbTAB[512];

#define Fp1   0x2000
#define Fp(X) ((int16_t)((X)*Fp1))
#define RGBlo 0
#define RGBhi 255

void init_tables()
{
  int Y, Yp, C, Cp;
  for(Y=0;Y<256;Y++)
  {
    if (Y<16) Yp= 16; else if (Y>235) Yp= 235; else Yp= Y;
    Ytab[Y]=  (Yp-16)*Fp1 / 219;
  }
  for(C=0;C<256;C++)
  {
    if (C<16) Cp= 16; else if (C>240) Cp= 240; else Cp= C;
    CrTAB[2*C+0]= Fp(1.402   * (Cp-128)/224);
    CrTAB[2*C+1]= Fp(0.71414 * (Cp-128)/224);
    CbTAB[2*C+0]= Fp(0.34414 * (Cp-128)/224);
    CbTAB[2*C+1]= Fp(1.772   * (Cp-128)/224);
  }
}

#define CLAMP(X,L,H) ( (X)<(L)? (L) : ( (X)>(H) ? (H) : (X) ) )

void conv_YCbCr_to_RGB(uint8_t Y, uint8_t Cb, uint8_t Cr,
                  uint8_t *R, uint8_t *G, uint8_t *B)
{
  int16_t fY;
  int32_t fR, fG, fB, iR, iG, iB;

  fY= Ytab[Y]; 
  
  fR= fY + CrTAB[2*Cr]; 
  fG= fY - (CrTAB+1)[2*Cr] - CbTAB[2*Cb]; 
  fB= fY + (CbTAB+1)[2*Cb]; 

  iR = ((fR<<8)-fR)>>13;
  iG = ((fG<<8)-fG)>>13;
  iB = ((fB<<8)-fB)>>13;

  *R=  CLAMP(iR,RGBlo,RGBhi);
  *G=  CLAMP(iG,RGBlo,RGBhi);
  *B=  CLAMP(iB,RGBlo,RGBhi);
}

Further Investigation So Far Fruitless

I did another test (spd7), where I computed all the tables prior to compilation and stored them as static const int16_t[]. This was a modification of the winner spd5. The result was no different at all. The compiler isn't able to make use of that information.

Another one was spd8. This was a modification of spd6. It turns out that my intuitive expression for clamping was grossly inefficient.

  I had used:
    (D^Z) | ( X & ~(D&Z) )
  A much better one is:
    ~Z & (X|D)

However, the jumping code still outperforms this. Amazing. I guess all the shifting eats up so many cycles, it's just better run a compact jumping code. Here are the latest results:

  image spd1  2814 usec
  image spd2  2615 usec
  image spd3  2454 usec
  image spd4  3233 usec
  image spd5  2019 usec
  image spd6  2727 usec
  image spd7  2022 usec
  image spd8  2327 usec
 random spd1  2828 usec
 random spd2  2611 usec
 random spd3  2495 usec
 random spd4  3268 usec
 random spd5  2042 usec
 random spd6  2765 usec
 random spd7  2014 usec
 random spd8  2365 usec

Anyway, the whole excursion took me from 2800 usecs to 2000 usecs for a 640x480 image. That's a 40% increase in speed. Not that bad.

A New Champion

I really wanted to make the clamping faster so I tried to eliminate the jumps from clamping code by utilizing conditional moves. I knew that gcc-O3 uses CMOV instructions for the conditional operator. Thus, a new champion spd9 was born from the previous one spd5.

int16_t  Ytab[256], CrTAB[512], CbTAB[512];

#define Fp1   0x2000
#define Fp(X) ((int16_t)((X)*Fp1))

void init_tables()
{
  int Y, Yp, C, Cp;
  for(Y=0;Y<256;Y++)
  {
    if (Y<16) Yp= 16; else if (Y>235) Yp= 235; else Yp= Y;
    Ytab[Y]=  (Yp-16)*Fp1 / 219;
  }
  for(C=0;C<256;C++)
  {
    if (C<16) Cp= 16; else if (C>240) Cp= 240; else Cp= C;
    CrTAB[2*C+0]= Fp(1.402   * (Cp-128)/224);
    CrTAB[2*C+1]= Fp(0.71414 * (Cp-128)/224);
    CbTAB[2*C+0]= Fp(0.34414 * (Cp-128)/224);
    CbTAB[2*C+1]= Fp(1.772   * (Cp-128)/224);
  }
}

void conv_YCbCr_to_RGB(uint8_t Y, uint8_t Cb, uint8_t Cr,
                  uint8_t *R, uint8_t *G, uint8_t *B)
{
  int16_t fY;
  int32_t fR, fG, fB, iR, iG, iB;

  fY= Ytab[Y]; 
  
  fR= fY + CrTAB[2*Cr]; 
  fG= fY - (CrTAB+1)[2*Cr] - CbTAB[2*Cb]; 
  fB= fY + (CbTAB+1)[2*Cb]; 

  iR = ((fR<<8)-fR)>>13;
  iG = ((fG<<8)-fG)>>13;
  iB = ((fB<<8)-fB)>>13;

  iR = iR < 0 ? 0 : iR;
  iG = iG < 0 ? 0 : iG;
  iB = iB < 0 ? 0 : iB;

  iR = iR > 255 ? 255 : iR;
  iG = iG > 255 ? 255 : iG;
  iB = iB > 255 ? 255 : iB;

  *R= iR; *G= iG; *B= iB; 
}

Here are the results:

       image     random 
spd1   2808      2826
spd2   2606      2610
spd3   2454      2491
spd4   3233      3257
spd5   2021      2037
spd6   2739      2757
spd7   2022      2013
spd8   2521      2544
spd9   1820      1830

This is 53% faster compared to the original algorithm. All the jumps are gone now. Following is the whole clamping and output code. iR,iG and iB are in %esi, %edx and %eax. I can't believe how clever is gcc-O3 (after I've told it what exactly to do, that is).

	xorl	%edi, %edi
#this instruction is leftover from computation of iB
	sarl	$13, %eax
	testl	%esi, %esi
	cmovs	%edi, %esi
	testl	%edx, %edx
	cmovs	%edi, %edx
	testl	%eax, %eax
	cmovs	%edi, %eax
	movb	$-1, %dil
	cmpl	$255, %esi
	cmovg	%edi, %esi
	cmpl	$255, %edx
	cmovg	%edi, %edx
	cmpl	$255, %eax
	movb	%sil, (%rcx)
	cmovle	%eax, %edi
	movb	%dl, (%r8)
	movb	%dil, (%r9)

I'll stop now. I think this is enough optimization. Here is the source code I used for testing these things. You, dear internet surfer, don't need it. There is nothing special in there. You can simply copy the above code for spd9 and be done with it. On the other hand, I might use that package if I need to revise something.

References

[1]: International Telecommunication Union Recommendation ITU-R BT.601-7