Downscaling Images

This is a solved problem, but the code is often in places hard to reach. I will make a single file solution for it, and then use it whenever necessary.

The problem I'm interested in is the case where an image is downscaled uniformly, using the same factor for both dimensions. I'm going to assume that both input and output pixels are square.

Application Scope

I'm implementing this functionality for the purpose of lost anti-aliasing information. My program renders an image at high resolution without anti-aliasing and then displays it on screen at full-screen resolution. The downscaling provides an effect similar to anti-aliasing when done accurately.

Rendering images at 3 to 4 times the size of the display device is usually sufficient to provide visually satisfactory output. Therefore, I will aim my calculations towards a maximum scale factor of 4.

Implementing the same algorithm for different maximum scaling factors is trivial. Only the table sizes and the number of integral bits in a fractional value changes by the maximum scale factor.

Starting Simple: One Dimensional Downscaling

I will use the same algorithm for both dimensions. Let's discover the one dimensional algorithm and then extend it to two.

The limunosity of an output pixel should be the weighted sum of input pixels it covers. The contribution of an input pixel (its weight) should be the ratio of its length covered by the output pixel to the length of the whole output pixel. For instance, if an output pixel P is 100 units long, and only 15 units of an input pixel K is covered by P, then contribution of K into P should be 15/100.

Let's make a picture to better understand this. Below is a downscaling situation where the length of an output pixel is 1.8 units and the length of an input pixel is 1.

We can easily calculate the values of output pixels as follows:
  out0= ( inp0 *   1 + inp1 * 0.8 ) / 1.8
  out1= ( inp1 * 0.2 + inp2 * 1  + inp3 * 0.6 ) / 1.8
  out2= ( inp3 * 0.4 + inp4 * 1  + inp5 * 0.4 ) / 1.8
  out3= ( inp5 * 0.6 + inp6 * 1  + inp7 * 0.2 ) / 1.8
At each step, if the weight used for the last input pixel was less than one, we use the rest of that last input pixel for the new output pixel. i.e. We used 0.8 of inp1 for out0. The remaining 0.2 of inp1 is added into out1.

This can be implemented pretty easily with an accumulator that gets reset at every pixel output.

  output_size= 1.8   // (in this case)
  total_weight=0
  accum=0
  output_index= 0
  for x = 0 .. input_width-1

    do_output= false

    if total_weight + 1 ≥ output_size
       pixel_weight = output_size - total_weight
       do_output= true
    else
       pixel_weight = 1
    endif

    accum += input[x] * pixel_weight
    
    if do_output
       output[output_index++]=  accum/output_size
       total_weight= 1-pixel_weight
       accum= total_weight*input[x]
    else
       total_weight += pixel_weight
    endif

  endfor
Here, we're doing one division per pixel. We can scale the constant 1 and output size by 1/output_size before the algorithm begins. This way, each weight will have already been divided by output size when being multiplied and accumulated.
  weight= 1/output_size // will be used instead of 1
  total_weight=0
  accum=0
  output_index= 0
  for x = 0 .. input_width-1

    do_output= false

    if total_weight + weight ≥ 1
       pixel_weight = 1 - total_weight
       do_output= true
    else
       pixel_weight = weight
    endif

    accum += input[x] * pixel_weight
    
    if do_output
       output[output_index++]= accum
       total_weight= weight-pixel_weight
       accum= total_weight*input[x]
    else
       total_weight += pixel_weight
    endif

  endfor
We can now make it prettier by putting almost everything inside the if statement.
  weight= 1/output_size // the scale factor
  total_weight=0
  accum=0
  output_index= 0
  for x = 0 .. input_width-1
    if total_weight + weight ≥ 1
       pixel_weight = 1 - total_weight
       output[output_index++]= accum + input[x] * pixel_weight
       total_weight= weight-pixel_weight
       accum= total_weight*input[x]
    else
       pixel_weight = weight
       accum +=input[x] * pixel_weight 
       total_weight += pixel_weight
    endif
  endfor
Between the green and red lines, we don't have any updates to total_weight or pixel_weight. Therefore, we can replace the red line with
  total_weight= weight - (1-total_weight)
      simplified:
  total_weight= total_weight + weight -1
We can also replace the pixel_weight variable with weight variable in the 'else' part of the 'if' statement.
       accum +=input[x] * pixel_weight 
       total_weight += pixel_weight
The variable pixel_weight has now only one use. Therefore, we can replace it with its value in the output statement
       output[output_index++]= accum + input[x] * (1-total_weight)
All said and done, here is the algorithm then
  weight= 1/scale_factor 
  total_weight=0
  accum=0
  output_index= 0
  for x = 0 .. input_width-1
    if total_weight + weight ≥ 1
       output[output_index++]= accum + input[x] * (1-total_weight)
       total_weight= total_weight+weight-1
       accum= total_weight*input[x]
    else
       accum +=input[x] * weight 
       total_weight += weight
    endif
  endfor

Choice of Numbers

The above algorithm can be implemented by either floating point or fixed point arithmetic. This choice will affect a couple of things in the final output. The variables that need to hold fractions are: A fixed point implementation offers various advantages regarding the accuracy of intermediate values. It also tends to be faster. Therefore, I will use fixed point numbers. All variables are limited to the range [0,1).

The choice of maximum scale factor and the number of bits used for representing each color in the linear color space determines how many bits are required for the accumulator. However, I can't do that math, it's too complicated for me. I will simply use 15 bits of fractional digits and 1 integral bit for everything.

Input Translation

The input colorspace is assumed to be sRGB. Values in this colorspace can not be used for interpolation directly. We need to convert them to linear values. There are only 256 distinct values for our case, which is 8-bit per channel.

I will use a lookup table for this, since the pow() function is quite slow.

  #define LINEAR_BITS  10
  static uint16_t srgb_s2l[256];
  static void init_s2l()
  {
    for(int i=0;i<256;i++)
      srgb_s2l[i]= (1<<LINEAR_BITS) * pow( i/255.0, 2.2 );
  }
If we had used 16 linear bits, our table entries wouldn't fit into an uint16_t because of the last entry. In that case, subtracting 1 from the last entry can be a solution.

Output Translation

The output is also sRGB, therefore we need another table to lookup sRGB values corresponding to linear values.

The more bits in the table we have, the more accurate our results will be. The required accuracy depends on the scaling factor. Given a scaling factor, increasing the accuracy beyond a certain limit doesn't benefit at all.

I don't know how to make this calculation though. I will simply use a 10-bit table and call it a day. Such a table is small enough to fit in a cache and hopefully big enough to provide visually acceptable output.

static unsigned char srgb_l2s[(1<<LINEAR_BITS)+1];
static void init_l2s()
{
  for(int i=0;i<=(1<<LINEAR_BITS);i++)
    srgb_l2s[i]= pow( (double)i / (1<<LINEAR_BITS), 1/2.2)*255;
}

Output Size

The final output size is determined by the number of times the output statement
       output[output_index++]= ..
is executed. This happens every time total_weight exceeds 1. At each step, we increment total_weight by weight. Therefore, the length of the output array is floor(weight*input_width).

When a given width is required, we can do the following:

  weight= ((double) output_width / input_width) * (1<<FRAC_BITS);
  
  while ((weight*input_width >> FRAC_BITS)<output_width)
    weight++;

Output Range

The value accumulated within the accumulator is supposed to be within the range [0,1]. All input values are within the range [0,1] and the sum of the weights applied to each is also 1. However, let's analyse if the sum can exceed 1 due to some errors regarding the number system.

The only place this can happen is during the output stage, within the statement:

       output[output_index++]= accum + input[x] * (1-total_weight)

                     let's simplify it as

       P= a*I1 + weight*(I2+I3+...IN-1) + (1-a-weight)*IN
This is an increasing function with regard to Ix. We will get the maximum value for P at Ix=1. If we replace all input values with 1, we get
   P= a+ weight + (1-a-weight)
When this expression is calculated using fixed point arithmetic, the result is exactly 1. Therefore, we don't need to test for inclusion in the range when doing lookup.

The same thing can't be said for floating point implementations. The expression (1-total_weight) overshoots the result when total_weight is close to, but not quite 1.

Extending to Two Dimensions

This is a trivial modification. Basicly, the input will be the intermediate row we computed above and the accumulator will be the preliminary output row. Height of the output image is computed similarly to the width.

Here is the final code for a single channel, with no padding around input and output rows.

#include <stdint.h>
#include <stdlib.h>
#include <math.h>


#define FRAC_BITS 15
#define COLOR_BITS 10
#define FIXED_ONE (1<<FRAC_BITS)

static uint16_t s2l[256];
static unsigned char l2s[(1<<COLOR_BITS)+1];
static void init_tables()
{
  static int init_ok;
  if (!init_ok)
  {
    for(int i=0;i<256;i++)
      s2l[i]= (1<<COLOR_BITS) * pow( i/255.0, 2.2 );
    for(int i=0;i<=(1<<COLOR_BITS);i++)
      l2s[i]= pow( (double)i / (1<<COLOR_BITS), 1/2.2)*255;
    init_ok= 1;
  }
} 

void resize_single_channel(uint16_t weight, int iW, int iH, 
                           uint8_t *input, uint8_t *output)
{
  int oW = weight*iW >> FRAC_BITS;

  // we need iW pixels for translated input
  //         oW pixels for the downscaled row
  //         oW pixels for the accumulator
  uint32_t  data_size= 2*oW*sizeof(uint32_t)+iW*sizeof(uint16_t);
  uint8_t  *data= malloc(data_size);
  uint32_t *const sS= (void*) data,    *const sE= sS + oW, *sp;
  uint16_t *const fS= (void*) sE,      *const fE= fS+iW, *fp; 
  uint16_t *const cS= (void*) fE,      *const cE= cS+oW, *cp;
  uint32_t  totaly, totalx;
  uint32_t  P, pix;

  init_tables();

  for(sp=sS;sp<sE;sp++) *sp= 0;
  for(totaly=0;iH;iH--)
  {
    for(fp=fS;fp<fE;fp++,input++) *fp= s2l[*input];
    for(fp=fS, totalx=pix=0,cp=cS;fp<fE && (P=*fp,1);fp++)
      if (totalx+weight>=FIXED_ONE)
      {
        *cp++= (pix + P * (FIXED_ONE-totalx)) >> FRAC_BITS; 
        totalx = totalx + weight - FIXED_ONE;
        pix= totalx * P;
      }
      else
      {
        pix += weight * P;
        totalx += weight;
      }

    if (totaly+weight>=FIXED_ONE)
    {
      uint32_t parwe= FIXED_ONE-totaly;
      totaly = totaly + weight - FIXED_ONE;
      for(sp=sS,cp=cS;sp<sE;sp++, cp++)
      {
        *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; 
        *sp= *cp * totaly;
      }
    }
    else
    {
      for(sp=sS,cp=cS;sp<sE;sp++, cp++) *sp += *cp * weight;
      totaly += weight;
    }
  }

  free(data);
}
The above code needs a little bit of explanation. We have three arrays: Pointers to the beginning and end of these are defined as *S and *E, respectively. Iterators are called *p.

totalx and totaly are the total_weight variables for each dimension.

Handling Multiple Color Planes

The above algorithms assumed a continuous single color image. However, my final result will be 3 or 4 color image with the 4th color being alpha.

For a production ready library, there are several image formats that need to be implemented:

Also, support for different pixel formats would be needed.

Implementing all this for the purpose of a resizing function seems overkill. It's better I leave the format conversions to the user and provide a couple of functions for the most common cases.

My main interest is to downscale a 24-bit image to 32-bit with dummy alpha. The alpha value is usually at the beginning or end of a pixel, at offset 0 or 3. The order of red, green and blue components don't really matter because they are all processed symmetrically. Therefore, I shall have 4 functions:

It would be possible to use the single channel function for each channel separately but it would thrash the cache. Doing all channels in parallel should be much faster. Here is the code so far for one the functions
// rgb -> rgba : fill alpha with 0xff
void resize_24_32a(uint16_t weight,
                   int iW, int iH, uint8_t *input, int input_bpl,
                   uint8_t *output, int output_bpl)
{
  int oW = weight*iW >> FRAC_BITS;

  uint32_t data_size=   oW*3*sizeof(uint32_t) + 
                      2*iW*3*sizeof(uint16_t);
  uint8_t  *data= malloc(data_size);
  uint32_t *const sS= (void*) data,    *const sE= sS + 3*oW, *sp;
  uint16_t *const fS= (void*) sE,      *const fE= fS + 3*iW, *fp; 
  uint16_t *const cS= (void*) fE,      *const cE= cS + 3*oW, *cp;
  uint32_t  totaly, totalx;
  uint32_t  P0, P1, P2, pix0, pix1, pix2;
  const uint32_t input_rowdelta= input_bpl - 3*iW;
  const uint32_t output_rowdelta= output_bpl - 4*oW;

  init_tables();

  for(sp=sS;sp<sE;sp++) *sp= 0;
  for(totaly=0;iH;iH--)
  {
    for(fp=fS;fp<fE;fp++,input++) *fp= s2l[*input];
    input += input_rowdelta;
    for(fp=fS, totalx=pix0=pix1=pix2=0,cp=cS;
        fp<fE && (P0=*fp++,P1=*fp++,P2=*fp++,1);)
      if (totalx+weight>=FIXED_ONE)
      {
        *cp++= (pix0 + P0 * (FIXED_ONE-totalx)) >> FRAC_BITS; 
        *cp++= (pix1 + P1 * (FIXED_ONE-totalx)) >> FRAC_BITS; 
        *cp++= (pix2 + P2 * (FIXED_ONE-totalx)) >> FRAC_BITS; 
        totalx = totalx + weight - FIXED_ONE;
        pix0= totalx * P0;
        pix1= totalx * P1;
        pix2= totalx * P2;
      }
      else
      {
        pix0 += weight * P0;
        pix1 += weight * P1;
        pix2 += weight * P2;
        totalx += weight;
      }

    if (totaly+weight>=FIXED_ONE)
    {
      uint32_t parwe= FIXED_ONE-totaly;
      totaly = totaly + weight - FIXED_ONE;
      for(sp=sS,cp=cS;sp<sE;)
      {
        *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; 
        *sp++= *cp++ * totaly;
        *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; 
        *sp++= *cp++ * totaly;
        *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; 
        *sp++= *cp++ * totaly;
        *output++= 0xff;
      }
      output += output_rowdelta;
    }
    else
    {
      for(sp=sS,cp=cS;sp<sE;sp++, cp++) *sp += *cp * weight;
      totaly += weight;
    }
  }

  free(data);
}
The above functions call malloc and free for the temporary rows they need. This can be avoided by allocating the space once and then reusing it by passing it inside a struct. The allocator functions could also determine the possible output widths, heights and weights.

Summing It Up

In any case, here is the code so far. It turned out to be pretty modular and easy to modify. If you need a function for a different format image, just look at the places where output is done and pix0..pix3 values are used. Don't forget to modify the allocation part :)