Hello !

I want to interface some asm with C for win some milliseconds and for the moment i've this code :

char buf[64] ; // char is needed, so don't say me to put unsigned long ;)


((unsigned long *)buf)[1] = 0;
((unsigned long *)buf)[2] = 0;
((unsigned long *)buf)[3] = 0;
((unsigned long *)buf)[4] = 0;
((unsigned long *)buf)[5] = 0;
((unsigned long *)buf)[6] = 0;
((unsigned long *)buf)[7] = 0;
((unsigned long *)buf)[8] = 0;
((unsigned long *)buf)[9] = 0;
((unsigned long *)buf)[10] = 0;
((unsigned long *)buf)[11] = 0;
((unsigned long *)buf)[12] = 0;
((unsigned long *)buf)[13] = 0;
((unsigned long *)buf)[14] = 0;
((unsigned long *)buf)[15] = 0;
Booouh ! It's awful but it's very more fast than a for loop !
The asm code generated is the follow
xor edx,edx

mov [ebp-xxx], edx
and so on 16x


I think i can make this in asm in a short of line et maybe more fastly using REPxx but asm is not my natural language ;)

Can you help me please ?
Thks in advance :)
Posted on 2002-09-01 10:59:25 by Cooling
I think this should work, but I haven't tested it.



_asm{
push eax
lea edi, [ebp-xxx]
xor eax,eax
mov ecx, 16
rep STOSD
pop eax
}

I think you can replace "" with the variable name, take a look in the manual to the compiler.
Posted on 2002-09-01 11:15:41 by scientica
I'll try that and reply for say what :)

Thks :)
Posted on 2002-09-01 11:16:45 by Cooling
This should be more readable(same as scientica). :)
    _asm

{
xor eax, eax
mov ecx, 16
[color=blue]lea edi, buf[/color]
rep stosd
}
If you don't care about compatibility of older processors.
    _asm

{
pxor MM0, MM0
mov edx, 64
lea eax, buf
}

zero:

_asm
{
sub edx, 8
movntq [eax+edx], MM0
test edx, edx
jnz zero
emms
}
Never tested or have timed this one. So I cannot say the MMX/SSE version is much faster. But it does zero the buffer 8 bytes per loop. This is probably useful for larger buffers. :grin:
Posted on 2002-09-01 11:37:07 by stryker
I tried your method scientica (& the more readable from stryker) but it take 3x more time :(

>stryker : i've tested your mmx code but it won't compile.. this isn't grave..

I made this code, i win 20% of execution time over my C code
        _asm{

lea edi,bloc
mov [edi],0
mov [edi+4],0
mov [edi+8],0
mov [edi+12],0
mov [edi+16],0
mov [edi+20],0
mov [edi+24],0
mov [edi+28],0
mov [edi+32],0
mov [edi+36],0
mov [edi+40],0
mov [edi+44],0
mov [edi+48],0
mov [edi+52],0
mov [edi+56],0
mov [edi+60],0
}


Thank you nevertheless ;)
Posted on 2002-09-01 11:55:17 by Cooling
BTW, you have to have P3 or higher or a processor that supports MMX and SSE. scientica's code actually saves code size. :grin: :)
Posted on 2002-09-01 11:56:35 by stryker

BTW, you have to have P3 or Processor that supports MMX and SSE.
Yes, of course, i'm not so stupid ;)
Posted on 2002-09-01 11:57:44 by Cooling
#include <stdio.h>

#include <string.h>

int main(void)
{
char buf[64] = "1234567890123456789012345678901234567890123456789012345678901234";
buf[64] = 0;

printf("\n%s\n\n", buf);

_asm
{
xor eax, eax
mov ecx, 16
lea edi, buf
rep stosd
}

printf("\n%s\n\n", buf);

strcpy(buf, "1234567890123456789012345678901234567890123456789012345678901234");

_asm
{
pxor MM0, MM0
mov edx, 64
lea eax, buf
}

zero:

_asm
{
sub edx, 8
movntq [eax+edx], MM0
test edx, edx
jnz zero
emms
}

printf("\n%s\n\n", buf);

return 0;
}
just tested on MS-VC 6. :) Unrolled loops are better. This is the reason why your version is faster. But I'm too lazy to test the speed now. :grin:

BTW there are no spaces in between " and 1 on the strcpy code above. It's the board doing magic on the code I posted. :grin:
Posted on 2002-09-01 11:58:57 by stryker
        _asm{

pxor mm0,mm0
movq bloc+0, mm0
movq bloc+8, mm0
movq bloc+16, mm0
movq bloc+24, mm0
movq bloc+32, mm0
movq bloc+40, mm0
movq bloc+48, mm0
movq bloc+56, mm0
}
The fastest code is the code that does not exist - try to find how to program without zero buffer.

stryker, he is going to be using the buffer right after clearing the buffer - using MOVNTQ shouldn't be used in this case. He would like the data in the cache.
Posted on 2002-09-01 12:07:56 by bitRAKE
stryker, he is going to be using the buffer right after clearing the buffer - using MOVNTQ shouldn't be used in this case. He would like the data in the cache.
silly me :grin: forgot about temporal and non-temporal. :grin: Replace with movq then. :)
Posted on 2002-09-01 12:19:09 by stryker
Cooling,

The way you have inlined the assignments is the fastest way you can do it, loop code will be slower as it has a loop overhead. Your choice is between the fastest inline or a smaller loop code and in most instances, they size difference does not matter where the speed difference may matter.

Regards,

hutch@movsd.com
Posted on 2002-09-02 03:37:22 by hutch--
Yes, i verified that ! :)

I returned to my first solution (in C) who produces clean & best code with gcc contrary to the compiler of Borland C++ Builder..
Posted on 2002-09-02 03:49:59 by Cooling