I'm using the code below to draw 24 bit bmp's (1024x768 or higher res) for a slideshow type app, each bmp is over 2mb and the drawing is kinda slow.. can I speed it up anyway? both src and dest are aligned on a 16 byte boundary..

mov  ecx, X*Y
mov  edi,
mov  esi,

    zero eax, ebx, edx

    mov  al,
    mov  bl,
    mov  dl,

    shl  bx, 8
    shl  edx, 16

    or  edx, ebx
    or  edx, eax

    mov  , edx

    add  edi, 4
    add  esi, 3
    sub  ecx, 1
    jnz  short drawbmp

Posted on 2009-10-07 23:05:34 by lone_samurai5
Start by moving your data 4 bytes at a time, forget that its 24bit data since it is tightly packed and we are simply block-copying it.. therefore we don't need to be accessing pixels individually, we can ignore the 'pixel boundaries'. I've taken the liberty to eliminate some unnecessary opcodes by using an indexed addressing mode for esi/edi accesses.

mov  ecx, X*Y
mov  edi,
mov  esi,

    mov  eax, dword ptr
    mov  dword ptr, eax
    dec ecx
    jnc  short drawbmp    ;jnz would miss pixel #0

That should be slightly more than 4 times faster.

Posted on 2009-10-08 00:18:23 by Homer
that doesn't draw the image properly.. it b,g,r,b,g,r etc in mem.. can't copy 4 bytes at a time..
Posted on 2009-10-08 00:35:23 by lone_samurai5
Oh, thats bad :D

You're copying it from a BMP image?
We need to swap the endin-ness of the dwords:

mov  ecx, X*Y
mov  edi,
mov  esi,

   mov  eax, dword ptr
   shl eax,8            ;eliminate unwanted high order byte
   bswap eax          ;swap endian-ness
   mov  dword ptr, eax
   dec ecx
   add esi,3
   add edi,3
   jnc  short drawbmp

Note that this code will write one extra zero byte at the end of the destination data which was not present in the source data - not a big deal.

You could use the first method if you preloaded all the images and reformatted the data to suit simple block image transfers ... then the first version would work too.
Posted on 2009-10-08 02:06:06 by Homer
All 24-bit bitmaps are 32-bit padded on each scan line. X*Y does not work as you intended. I'm also not sure why you're swapping bytes around. What happened to using memcopy or even BitBlt?

Posted on 2009-10-08 11:51:42 by Sparafusile
    dec ecx
    jnc  short drawbmp    ;jnz would miss pixel #0

For those who may not yet be aware of it, the "dec" opcode does NOT modify the carry flag. Because nothing else in the subject loop would modify the carry flag, the exit from the loop would depend on the condition of the carry flag on entrance, i.e. exit after first pass if set or endless loop if clear.

In this current context, the proper instruction would be:
  jns short drawbmp ;which would cover pixel #0

However, with ecx initialized with X*Y, the first pass of the loop would process data beyond the last pixel. It should be initialized to X*Y-1.
Posted on 2009-10-09 21:33:36 by Raymond
Thanks Ray, I guess I wasn't thinking.
That's a sign of early onset senility :|
Posted on 2009-10-11 03:54:10 by Homer
thanks for the help guys.. i pretty much just switched to using bmp that are padded to 32 bit and block copy it now..
Posted on 2009-10-12 22:02:22 by lone_samurai5
If I recall correctly, BMP images are ALWAYS padded to 32 bits "per horizontal scan line", this is known as the "image Stride", which is typically slightly larger than the image width * the Bytes Per Pixel. Intuitively, this means that the BMP image, as a whole, is also 32-bit aligned, irrespective of the pixel format.

Posted on 2009-10-13 00:26:50 by Homer
hmm photoshop doesn't seem to do that when i save it as a 24 bit bmp.. 
Posted on 2009-10-13 01:01:17 by lone_samurai5