Which is the best algo for memset ?

mov edi, dest
mov ecx, count
mov al, c
rep stosb


mov ecx, count
mov al, c
mov ebx, dest
mov byte ptr , al
dec ecx
jnz AA

Which is the fastest ? is there a better algo than these 2 algo ?

Posted on 2002-07-01 18:10:01 by DarkEmpire
better algo is to align your data to DWORD and use:
mov edi, offset _data

mov ecx, sizeof _data / 4
mov eax,0c0c0c0ch
rep stosd
Or, use MMX fill and align data by QWORD.
Posted on 2002-07-01 18:30:35 by bitRAKE
But, how can i align my data to dword or qword?

ex: a string a 3 bytes

Posted on 2002-07-02 00:33:04 by DarkEmpire
If the size is known at assemble time then:
and DWORD PTR [string], 0FF000000h

; put another instruction here
or DWORD PTR [string], 0xyxyxyh ; xy is the fill byte
If the size is only known at runtime then use either "rep stosb" or a combination of "stosb/stosd". Here are other threads regaurding this subject:
Posted on 2002-07-02 07:09:54 by bitRAKE
I tried the suggested solutions, here are my conclusion:

I tested performance using cpuid / rdtsc... The solutions with (stosb or combination of stosd and stosb) is at least 4 times slower than the memset of libc.lib...

Have i done something wrong ?? i thought that memset was not so optimized than an own memset, but it is...

Is there a trick ?
Posted on 2002-07-08 05:17:30 by DarkEmpire
Assuming that your using libc.lib from VC - it is probably optimized. What size memory are you trying to set? IIRC, STOSD works best with sizes greaterthan L2 cache. I get best performance with MOVNTQ loop - with the side benefit of not poluting the cache.
Posted on 2002-07-08 07:14:17 by bitRAKE
If you are looking for a fast memory copy algorithm, using libraries like libc would be contradictory to your objective. Calling another routine puts overheard, so you no longer have the maximum speed. I really think a simple rep stosd should be sufficient for most needs.
Posted on 2002-07-08 12:08:53 by comrade
i tried with a 1000000 byte array,

I was surprised that on a celeron, rep stosd is faster than libc memset, but on my PIII 800, it is 4 times slower ...

Here is my code:
it is inline asm in VC++:

void * __cdecl memsetASM(void *dest, int c, size_t count)
// Get the parameters for stos
mov eax, c
mov edi, dest
mov ecx, count

// Fill eax with the byte c
shl eax, 8
or eax, c
mov edx, eax
bswap eax
or eax, edx

shr ecx, 2
rep stosd
mov ecx, count
and ecx, 3
rep stosb


return dest;

What 's the problem ?
Posted on 2002-07-08 13:23:21 by DarkEmpire