The problem I'm interested in is the case where an image is downscaled uniformly, using the same factor for both dimensions. I'm going to assume that both input and output pixels are square.

Rendering images at 3 to 4 times the size of the display device is usually sufficient to provide visually satisfactory output. Therefore, I will aim my calculations towards a maximum scale factor of 4.

Implementing the same algorithm for different maximum scaling factors is trivial. Only the table sizes and the number of integral bits in a fractional value changes by the maximum scale factor.

The limunosity of an output pixel should be the weighted sum of input pixels it covers. The contribution of an input pixel (its weight) should be the ratio of its length covered by the output pixel to the length of the whole output pixel. For instance, if an output pixel P is 100 units long, and only 15 units of an input pixel K is covered by P, then contribution of K into P should be 15/100.

Let's make a picture to better understand this. Below is a downscaling situation where the length of an output pixel is 1.8 units and the length of an input pixel is 1.

out0= ( inp0 * 1 + inp1 * 0.8 ) / 1.8 out1= ( inp1 * 0.2 + inp2 * 1 + inp3 * 0.6 ) / 1.8 out2= ( inp3 * 0.4 + inp4 * 1 + inp5 * 0.4 ) / 1.8 out3= ( inp5 * 0.6 + inp6 * 1 + inp7 * 0.2 ) / 1.8At each step, if the weight used for the last input pixel was less than one, we use the rest of that last input pixel for the new output pixel. i.e. We used 0.8 of inp1 for out0. The remaining 0.2 of inp1 is added into out1.

This can be implemented pretty easily with an accumulator that gets reset at every pixel output.

output_size= 1.8 // (in this case) total_weight=0 accum=0 output_index= 0 for x = 0 .. input_width-1 do_output= false if total_weight + 1 ≥ output_size pixel_weight = output_size - total_weight do_output= true else pixel_weight = 1 endif accum += input[x] * pixel_weight if do_output output[output_index++]= accum/output_size total_weight= 1-pixel_weight accum= total_weight*input[x] else total_weight += pixel_weight endif endforHere, we're doing one division per pixel. We can scale the constant 1 and output size by 1/output_size before the algorithm begins. This way, each weight will have already been divided by output size when being multiplied and accumulated.

weight= 1/output_size // will be used instead of 1 total_weight=0 accum=0 output_index= 0 for x = 0 .. input_width-1 do_output= false if total_weight + weight ≥ 1 pixel_weight = 1 - total_weight do_output= true else pixel_weight = weight endif accum += input[x] * pixel_weight if do_output output[output_index++]= accum total_weight= weight-pixel_weight accum= total_weight*input[x] else total_weight += pixel_weight endif endforWe can now make it prettier by putting almost everything inside the if statement.

weight= 1/output_size // the scale factor total_weight=0 accum=0 output_index= 0 for x = 0 .. input_width-1 if total_weight + weight ≥ 1 pixel_weight = 1 - total_weight output[output_index++]= accum + input[x] * pixel_weight total_weight= weight-pixel_weight accum= total_weight*input[x] else pixel_weight = weight accum +=input[x] * pixel_weight total_weight += pixel_weight endif endforBetween the green and red lines, we don't have any updates to total_weight or pixel_weight. Therefore, we can replace the red line with

total_weight= weight - (1-total_weight) simplified: total_weight= total_weight + weight -1We can also replace the pixel_weight variable with weight variable in the 'else' part of the 'if' statement.

accum +=input[x] * pixel_weight total_weight += pixel_weightThe variable pixel_weight has now only one use. Therefore, we can replace it with its value in the output statement

output[output_index++]= accum + input[x] * (1-total_weight)All said and done, here is the algorithm then

weight= 1/scale_factor total_weight=0 accum=0 output_index= 0 for x = 0 .. input_width-1 if total_weight + weight ≥ 1 output[output_index++]= accum + input[x] * (1-total_weight) total_weight= total_weight+weight-1 accum= total_weight*input[x] else accum +=input[x] * weight total_weight += weight endif endfor

- total_weight
- weight
- accumulator

The choice of maximum scale factor and the number of bits used for representing each color in the linear color space determines how many bits are required for the accumulator. However, I can't do that math, it's too complicated for me. I will simply use 15 bits of fractional digits and 1 integral bit for everything.

I will use a lookup table for this, since the pow() function is quite slow.

#define LINEAR_BITS 10 static uint16_t srgb_s2l[256]; static void init_s2l() { for(int i=0;i<256;i++) srgb_s2l[i]= (1<<LINEAR_BITS) * pow( i/255.0, 2.2 ); }If we had used 16 linear bits, our table entries wouldn't fit into an uint16_t because of the last entry. In that case, subtracting 1 from the last entry can be a solution.

The more bits in the table we have, the more accurate our results will be. The required accuracy depends on the scaling factor. Given a scaling factor, increasing the accuracy beyond a certain limit doesn't benefit at all.

I don't know how to make this calculation though. I will simply use a 10-bit table and call it a day. Such a table is small enough to fit in a cache and hopefully big enough to provide visually acceptable output.

static unsigned char srgb_l2s[(1<<LINEAR_BITS)+1]; static void init_l2s() { for(int i=0;i<=(1<<LINEAR_BITS);i++) srgb_l2s[i]= pow( (double)i / (1<<LINEAR_BITS), 1/2.2)*255; }

output[output_index++]= ..is executed. This happens every time total_weight exceeds 1. At each step, we increment total_weight by weight. Therefore, the length of the output array is floor(weight*input_width).

When a given width is required, we can do the following:

weight= ((double) output_width / input_width) * (1<<FRAC_BITS); while ((weight*input_width >> FRAC_BITS)<output_width) weight++;

The only place this can happen is during the output stage, within the statement:

output[output_index++]= accum + input[x] * (1-total_weight) let's simplify it as P= a*IThis is an increasing function with regard to I_{1}+ weight*(I_{2}+I_{3}+...I_{N-1}) + (1-a-weight)*I_{N}

P= a+ weight + (1-a-weight)When this expression is calculated using fixed point arithmetic, the result is exactly 1. Therefore, we don't need to test for inclusion in the range when doing lookup.

The same thing can't be said for floating point implementations. The expression (1-total_weight) overshoots the result when total_weight is close to, but not quite 1.

Here is the final code for a single channel, with no padding around input and output rows.

#include <stdint.h> #include <stdlib.h> #include <math.h> #define FRAC_BITS 15 #define COLOR_BITS 10 #define FIXED_ONE (1<<FRAC_BITS) static uint16_t s2l[256]; static unsigned char l2s[(1<<COLOR_BITS)+1]; static void init_tables() { static int init_ok; if (!init_ok) { for(int i=0;i<256;i++) s2l[i]= (1<<COLOR_BITS) * pow( i/255.0, 2.2 ); for(int i=0;i<=(1<<COLOR_BITS);i++) l2s[i]= pow( (double)i / (1<<COLOR_BITS), 1/2.2)*255; init_ok= 1; } } void resize_single_channel(uint16_t weight, int iW, int iH, uint8_t *input, uint8_t *output) { int oW = weight*iW >> FRAC_BITS; // we need iW pixels for translated input // oW pixels for the downscaled row // oW pixels for the accumulator uint32_t data_size= 2*oW*sizeof(uint32_t)+iW*sizeof(uint16_t); uint8_t *data= malloc(data_size); uint32_t *const sS= (void*) data, *const sE= sS + oW, *sp; uint16_t *const fS= (void*) sE, *const fE= fS+iW, *fp; uint16_t *const cS= (void*) fE, *const cE= cS+oW, *cp; uint32_t totaly, totalx; uint32_t P, pix; init_tables(); for(sp=sS;sp<sE;sp++) *sp= 0; for(totaly=0;iH;iH--) { for(fp=fS;fp<fE;fp++,input++) *fp= s2l[*input]; for(fp=fS, totalx=pix=0,cp=cS;fp<fE && (P=*fp,1);fp++) if (totalx+weight>=FIXED_ONE) { *cp++= (pix + P * (FIXED_ONE-totalx)) >> FRAC_BITS; totalx = totalx + weight - FIXED_ONE; pix= totalx * P; } else { pix += weight * P; totalx += weight; } if (totaly+weight>=FIXED_ONE) { uint32_t parwe= FIXED_ONE-totaly; totaly = totaly + weight - FIXED_ONE; for(sp=sS,cp=cS;sp<sE;sp++, cp++) { *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; *sp= *cp * totaly; } } else { for(sp=sS,cp=cS;sp<sE;sp++, cp++) *sp += *cp * weight; totaly += weight; } } free(data); }The above code needs a little bit of explanation. We have three arrays:

- fresh: an input row mapped thru the s2l array. i.e. linear color values for the current input row
- cooked: the result of downscaling the current input row, using the one dimensional algorithm.
- served: the row accumulator. it's used as the second dimension accumulator. it's an array because the input to the second dimension is a row (cooked).

totalx and totaly are the total_weight variables for each dimension.

For a production ready library, there are several image formats that need to be implemented:

- Grayscale only
- Grayscale + alpha
- Alpha only: like grayscale, but no lookups
- RGB
- RGB + alpha
- RGB + dummy alpha: the alpha value is never used or computed, but occupies space within a pixel.

Implementing all this for the purpose of a resizing function seems overkill. It's better I leave the format conversions to the user and provide a couple of functions for the most common cases.

My main interest is to downscale a 24-bit image to 32-bit with dummy alpha. The alpha value is usually at the beginning or end of a pixel, at offset 0 or 3. The order of red, green and blue components don't really matter because they are all processed symmetrically. Therefore, I shall have 4 functions:

- rgb -> argb : alpha channel skipped
- rgb -> rgba : alpha channel skipped
- rgb -> argb : alpha channel filled with dummy byte
- rgb -> rgba : alpha channel filled with dummy byte

// rgb -> rgba : fill alpha with 0xff void resize_24_32a(uint16_t weight, int iW, int iH, uint8_t *input, int input_bpl, uint8_t *output, int output_bpl) { int oW = weight*iW >> FRAC_BITS; uint32_t data_size= oW*3*sizeof(uint32_t) + 2*iW*3*sizeof(uint16_t); uint8_t *data= malloc(data_size); uint32_t *const sS= (void*) data, *const sE= sS + 3*oW, *sp; uint16_t *const fS= (void*) sE, *const fE= fS + 3*iW, *fp; uint16_t *const cS= (void*) fE, *const cE= cS + 3*oW, *cp; uint32_t totaly, totalx; uint32_t P0, P1, P2, pix0, pix1, pix2; const uint32_t input_rowdelta= input_bpl - 3*iW; const uint32_t output_rowdelta= output_bpl - 4*oW; init_tables(); for(sp=sS;sp<sE;sp++) *sp= 0; for(totaly=0;iH;iH--) { for(fp=fS;fp<fE;fp++,input++) *fp= s2l[*input]; input += input_rowdelta; for(fp=fS, totalx=pix0=pix1=pix2=0,cp=cS; fp<fE && (P0=*fp++,P1=*fp++,P2=*fp++,1);) if (totalx+weight>=FIXED_ONE) { *cp++= (pix0 + P0 * (FIXED_ONE-totalx)) >> FRAC_BITS; *cp++= (pix1 + P1 * (FIXED_ONE-totalx)) >> FRAC_BITS; *cp++= (pix2 + P2 * (FIXED_ONE-totalx)) >> FRAC_BITS; totalx = totalx + weight - FIXED_ONE; pix0= totalx * P0; pix1= totalx * P1; pix2= totalx * P2; } else { pix0 += weight * P0; pix1 += weight * P1; pix2 += weight * P2; totalx += weight; } if (totaly+weight>=FIXED_ONE) { uint32_t parwe= FIXED_ONE-totaly; totaly = totaly + weight - FIXED_ONE; for(sp=sS,cp=cS;sp<sE;) { *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; *sp++= *cp++ * totaly; *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; *sp++= *cp++ * totaly; *output++= l2s[ (*sp + *cp * parwe) >> FRAC_BITS ]; *sp++= *cp++ * totaly; *output++= 0xff; } output += output_rowdelta; } else { for(sp=sS,cp=cS;sp<sE;sp++, cp++) *sp += *cp * weight; totaly += weight; } } free(data); }The above functions call malloc and free for the temporary rows they need. This can be avoided by allocating the space once and then reusing it by passing it inside a struct. The allocator functions could also determine the possible output widths, heights and weights.