I had time today to have a dabble on the P4 as I had been doing some memory tests on it anyway and one of ther things I needed on it was a memory fill.

I did the tests with a STOSD algo but as I was interested to see what it clocked like on a P4, I wrote a set of different methods that used incremented pointers, MMX and XMM in a simple similar configuration.

Contrary to my testing on my PIII, the incremented pointers were faster than the STOSD algo. I did a similar algo using MMX loading with MOVQ and writing to memory with MOVNTQ and it was measurably faster.

I did the next one using XMM loading with MOVUPS and writing with MOVNTPS but it produced identical times to the MMX version.

On 128 meg, 16 byte aligned, the times were as follows,


STOSD 602 ms
incremented pointers 580 ms
MMX 453 ms
XMM 453 ms

What I found interesting was that with the three algos that could be unrolled, none were any faster at all when unrolled by a factor of 8 or smaller.

The algos below are test pieces and have no trickery with cache or prefetch.


; ########################################################################

MemFill3 proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD

LOCAL sixteenbyte:OWORD

push esi

mov esi, lpTarget
mov ecx, lnth
shr ecx, 4 ; div by 16
mov eax, fillchar

mov DWORD PTR sixteenbyte[0], eax
mov DWORD PTR sixteenbyte[4], eax
mov DWORD PTR sixteenbyte[8], eax
mov DWORD PTR sixteenbyte[12], eax

movups xmm(0), sixteenbyte ; load xmm(0) with the 4 DWORD fill values

@@:
movntps [esi], xmm(0)
add esi, 16
dec ecx
jnz @B

pop esi

ret

MemFill3 endp

; ########################################################################

MemFill2 proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD

LOCAL eightbyte:QWORD

push esi

mov esi, lpTarget
mov ecx, lnth
shr ecx, 3 ; div by 8
mov eax, fillchar

mov DWORD PTR eightbyte[0], eax
mov DWORD PTR eightbyte[4], eax

movq mm(0), eightbyte ; load mm(0) with the 2 DWORD fill values

@@:
movntq [esi], mm(0)
add esi, 8
dec ecx
jnz @B

emms

pop esi

ret

MemFill2 endp

; ########################################################################

MemFill1 proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD

mov edx, lpTarget
mov ecx, lnth
add ecx, edx
mov eax, fillchar
sub edx, 4

@@:
add edx, 4
mov [edx], eax
cmp edx, ecx
jl @B

ret

MemFill1 endp

; ########################################################################

MemFill proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD

push edi

mov edi, lpTarget
mov ecx, lnth
shr ecx, 2
mov eax, fillchar
rep stosd

pop edi

ret

MemFill endp

; ##########################################################################

About all I can think of is that the memory type in this box which is PC 133 SDRAM may be giving compressed results.

Regards,

hutch@movsd.com
Posted on 2002-08-13 06:37:48 by hutch--
Could you please attach a binary test programm? I am at work here on a P4 but without MASM :)
Posted on 2002-08-13 06:48:51 by bazik
Attached.

One more piece of wisdom, with the memory testing algo, I again tried to replace a CMP/JMP with a CMOV?? but as with every instance that I have done it before it was measurably slower trhan the CMP/JMP.

Regards,

hutch@movsd.com
Posted on 2002-08-13 07:45:58 by hutch--
I get the following times on my p4, RIMM ram - faster than pc133, but the results seem consistant to what you got.

435
421
367
364

clicking on the buttons from left to right :-)
Posted on 2002-08-13 10:14:01 by Terab

The algos below are test pieces and have no trickery with cache or prefetch.

"The Pentium 4 processor provides hardware prefetching, in addition to software
prefetching. The hardware prefetcher operates transparently to fetch data and
instruction streams from memory, without requiring programmer?s intervention."
Posted on 2002-08-13 10:46:35 by Nexo