i'm trying optimize code run bit faster. taking +30ms update 3776000 bytes. if remove outpx
updates inside function runs @ 3ms meaning updates outpx
making function slower.
any potential feedback on how improve speed of function below appreciated.
uint8_t* outpx = (uint8_t*)out.data; (int px=0; px<pxsize; px+=4) { newtopalpha = (alpha*inpx[px+3]); if (0xff == newtopalpha) { // top opaque covers entire bottom // set copy on bgr colors outpx[px] = inpx[px]; outpx[px+1] = inpx[px+1]; outpx[px+2] = inpx[px+2]; outpx[px+3] = 0xff; //fully opaque } else if (0x00 != newtopalpha) { // top not transparent topalpha = newtopalpha/(float)0xff; bottomalpha = outpx[px+3]/(float)0xff; newalpha = topalpha + bottomalpha*(1-topalpha); alphachange = bottomalpha*(1-topalpha); outpx[px] = (uint8_t)((inpx[px]*topalpha + outpx[px]*alphachange)/newalpha); outpx[px+1] = (uint8_t)((inpx[px+1]*topalpha + outpx[px+1]*alphachange)/newalpha); outpx[px+2] = (uint8_t)((inpx[px+2]*topalpha + outpx[px+2]*alphachange)/newalpha); outpx[px+3] = (uint8_t)(newalpha*0xff); } }
ok, if bottleneck, , can't use gpu / built-in methods random reason, there lot can do:
uint8_t *outpx = (uint8_t*) out.data; const int calpha = (int) (alpha * 256.0f + 0.5f); for( int px = 0; px < pxsize; px += 4 ) { const int topalpha = (calpha * (int) inpx[px|3]) >> 8; // note | not + tiny speed boost if( topalpha == 255 ) { memcpy( &outpx[px], &inpx[px], 4 ); // might slower per-component copying; benchmark! } else if( topalpha ) { const int bottomalpha = (int) outpx[px|3]; const int alphachange = (bottomalpha * (255 - topalpha)) / 255; const int newalpha = topalpha + alphachange; outpx[px ] = (uint8_t) ((inpx[px ]*topalpha + outpx[px ]*alphachange) / newalpha); outpx[px|1] = (uint8_t) ((inpx[px|1]*topalpha + outpx[px|1]*alphachange) / newalpha); outpx[px|2] = (uint8_t) ((inpx[px|2]*topalpha + outpx[px|2]*alphachange) / newalpha); outpx[px|3] = (uint8_t) newalpha; } }
the main change there no floating point arithmetic more (i might have missed /255
or something, idea). removed repeated calculations , used bit operators possible. optimisation use fixed-precision arithmetic change 3 divides single divide , 3 multiply/bitshifts. you'd have benchmark confirm helps. memcpy
might faster. again, need benchmark.
finally, if know images, give compiler hints branching. example, in gcc can if( __builtin_expect( topalpha == 255, 1 ) )
if know of image solid colour, , alpha
1.0.
update based on comments:
and love of sanity, never (never) benchmark optimisations turned off.
Comments
Post a Comment