I had time today to have a dabble on the P4 as I had been doing some memory tests on it anyway and one of ther things I needed on it was a memory fill.
I did the tests with a STOSD algo but as I was interested to see what it clocked like on a P4, I wrote a set of different methods that used incremented pointers, MMX and XMM in a simple similar configuration.
Contrary to my testing on my PIII, the incremented pointers were faster than the STOSD algo. I did a similar algo using MMX loading with MOVQ and writing to memory with MOVNTQ and it was measurably faster.
I did the next one using XMM loading with MOVUPS and writing with MOVNTPS but it produced identical times to the MMX version.
On 128 meg, 16 byte aligned, the times were as follows,
What I found interesting was that with the three algos that could be unrolled, none were any faster at all when unrolled by a factor of 8 or smaller.
The algos below are test pieces and have no trickery with cache or prefetch.
About all I can think of is that the memory type in this box which is PC 133 SDRAM may be giving compressed results.
Regards,
hutch@movsd.com
I did the tests with a STOSD algo but as I was interested to see what it clocked like on a P4, I wrote a set of different methods that used incremented pointers, MMX and XMM in a simple similar configuration.
Contrary to my testing on my PIII, the incremented pointers were faster than the STOSD algo. I did a similar algo using MMX loading with MOVQ and writing to memory with MOVNTQ and it was measurably faster.
I did the next one using XMM loading with MOVUPS and writing with MOVNTPS but it produced identical times to the MMX version.
On 128 meg, 16 byte aligned, the times were as follows,
STOSD 602 ms
incremented pointers 580 ms
MMX 453 ms
XMM 453 ms
What I found interesting was that with the three algos that could be unrolled, none were any faster at all when unrolled by a factor of 8 or smaller.
The algos below are test pieces and have no trickery with cache or prefetch.
; ########################################################################
MemFill3 proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD
LOCAL sixteenbyte:OWORD
push esi
mov esi, lpTarget
mov ecx, lnth
shr ecx, 4 ; div by 16
mov eax, fillchar
mov DWORD PTR sixteenbyte[0], eax
mov DWORD PTR sixteenbyte[4], eax
mov DWORD PTR sixteenbyte[8], eax
mov DWORD PTR sixteenbyte[12], eax
movups xmm(0), sixteenbyte ; load xmm(0) with the 4 DWORD fill values
@@:
movntps [esi], xmm(0)
add esi, 16
dec ecx
jnz @B
pop esi
ret
MemFill3 endp
; ########################################################################
MemFill2 proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD
LOCAL eightbyte:QWORD
push esi
mov esi, lpTarget
mov ecx, lnth
shr ecx, 3 ; div by 8
mov eax, fillchar
mov DWORD PTR eightbyte[0], eax
mov DWORD PTR eightbyte[4], eax
movq mm(0), eightbyte ; load mm(0) with the 2 DWORD fill values
@@:
movntq [esi], mm(0)
add esi, 8
dec ecx
jnz @B
emms
pop esi
ret
MemFill2 endp
; ########################################################################
MemFill1 proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD
mov edx, lpTarget
mov ecx, lnth
add ecx, edx
mov eax, fillchar
sub edx, 4
@@:
add edx, 4
mov [edx], eax
cmp edx, ecx
jl @B
ret
MemFill1 endp
; ########################################################################
MemFill proc lpTarget:DWORD,lnth:DWORD,fillchar:DWORD
push edi
mov edi, lpTarget
mov ecx, lnth
shr ecx, 2
mov eax, fillchar
rep stosd
pop edi
ret
MemFill endp
; ##########################################################################
About all I can think of is that the memory type in this box which is PC 133 SDRAM may be giving compressed results.
Regards,
hutch@movsd.com
Could you please attach a binary test programm? I am at work here on a P4 but without MASM :)
Attached.
One more piece of wisdom, with the memory testing algo, I again tried to replace a CMP/JMP with a CMOV?? but as with every instance that I have done it before it was measurably slower trhan the CMP/JMP.
Regards,
hutch@movsd.com
One more piece of wisdom, with the memory testing algo, I again tried to replace a CMP/JMP with a CMOV?? but as with every instance that I have done it before it was measurably slower trhan the CMP/JMP.
Regards,
hutch@movsd.com
I get the following times on my p4, RIMM ram - faster than pc133, but the results seem consistant to what you got.
435
421
367
364
clicking on the buttons from left to right :-)
435
421
367
364
clicking on the buttons from left to right :-)
The algos below are test pieces and have no trickery with cache or prefetch.
"The Pentium 4 processor provides hardware prefetching, in addition to software
prefetching. The hardware prefetcher operates transparently to fetch data and
instruction streams from memory, without requiring programmer?s intervention."