c++ - Modifing data from uint8_t array very slow? -


i'm trying optimize code run bit faster. taking +30ms update 3776000 bytes. if remove outpx updates inside function runs @ 3ms meaning updates outpx making function slower.

any potential feedback on how improve speed of function below appreciated.

uint8_t* outpx = (uint8_t*)out.data; (int px=0; px<pxsize; px+=4)     {         newtopalpha = (alpha*inpx[px+3]);          if (0xff == newtopalpha)         {             // top opaque covers entire bottom              // set copy on bgr colors             outpx[px] = inpx[px];             outpx[px+1] = inpx[px+1];             outpx[px+2] = inpx[px+2];             outpx[px+3] = 0xff; //fully opaque         }         else if (0x00 != newtopalpha)         {             // top not transparent             topalpha = newtopalpha/(float)0xff;             bottomalpha = outpx[px+3]/(float)0xff;             newalpha = topalpha + bottomalpha*(1-topalpha);             alphachange = bottomalpha*(1-topalpha);              outpx[px] = (uint8_t)((inpx[px]*topalpha + outpx[px]*alphachange)/newalpha);             outpx[px+1] = (uint8_t)((inpx[px+1]*topalpha + outpx[px+1]*alphachange)/newalpha);             outpx[px+2] = (uint8_t)((inpx[px+2]*topalpha + outpx[px+2]*alphachange)/newalpha);             outpx[px+3] = (uint8_t)(newalpha*0xff);         }     } 

ok, if bottleneck, , can't use gpu / built-in methods random reason, there lot can do:

uint8_t *outpx = (uint8_t*) out.data; const int calpha = (int) (alpha * 256.0f + 0.5f); for( int px = 0; px < pxsize; px += 4 ) {     const int topalpha = (calpha * (int) inpx[px|3]) >> 8; // note | not + tiny speed boost      if( topalpha == 255 ) {         memcpy( &outpx[px], &inpx[px], 4 ); // might slower per-component copying; benchmark!     } else if( topalpha ) {         const int bottomalpha = (int) outpx[px|3];         const int alphachange = (bottomalpha * (255 - topalpha)) / 255;         const int newalpha    = topalpha + alphachange;          outpx[px  ] = (uint8_t) ((inpx[px  ]*topalpha + outpx[px  ]*alphachange) / newalpha);         outpx[px|1] = (uint8_t) ((inpx[px|1]*topalpha + outpx[px|1]*alphachange) / newalpha);         outpx[px|2] = (uint8_t) ((inpx[px|2]*topalpha + outpx[px|2]*alphachange) / newalpha);         outpx[px|3] = (uint8_t) newalpha;     } } 

the main change there no floating point arithmetic more (i might have missed /255 or something, idea). removed repeated calculations , used bit operators possible. optimisation use fixed-precision arithmetic change 3 divides single divide , 3 multiply/bitshifts. you'd have benchmark confirm helps. memcpy might faster. again, need benchmark.

finally, if know images, give compiler hints branching. example, in gcc can if( __builtin_expect( topalpha == 255, 1 ) ) if know of image solid colour, , alpha 1.0.


update based on comments:

and love of sanity, never (never) benchmark optimisations turned off.


Comments