Since I do not have access to a P4. I think I could ask this here.

what of this code pieces is faster on a P4?



REPT 2
mov eax, [esi]
add esi, 4
mov [edi], eax
add edi, 4
ENDM

mov eax, [esi]
add esi, sizeof(VERTEX) - 8
mov [edi], eax
add edi, 8


or



i = 0
REPT 3
mov eax, [esi + i]
mov [edi + i], eax
i = i + 4
ENDM
add esi, sizeof(VERTEX)
add edi, 16
Posted on 2002-01-02 14:57:36 by dxantos
I dont think anyone will be able to tell you because its hard to tell how fast a piece of code is. What you could do is add up the time it takes each instruction. Youll find the values in a manuel from the intel site.
Posted on 2002-01-02 17:37:14 by ChimpFace9000
The fastest way, if possible:


mov ecx,-1*(LoopCount * (SIZEOF WhatEver))
@@:mov eax,[esi+ecx+(LoopCount * (SIZEOF WhatEver))]
mov [edi+ecx+(LoopCount * (SIZEOF WhatEver))],eax
add ecx,SIZEOF WhatEver
jne @B
This is a little tricky, so work through it slowly. ;)
Unroll it some and use prefetch, else use movsd.
Might want to read { this PDF }?
Posted on 2002-01-02 21:52:06 by bitRAKE
Hi Bitrake !

Perhabs this one's a bit faster because usage of eax as destination and source register does not follow directly:

(SIZEOF WHATEVER) has to be a multiply of 4 !



mov ecx,-1*(LoopCount * (SIZEOF WhatEver))
@@:mov eax,[esi+ecx+(LoopCount * (SIZEOF WhatEver))]
add ecx, 4
mov -4[edi+ecx+(LoopCount * (SIZEOF WhatEver))],eax
jne @B



Greetings, CALEB
Posted on 2002-01-03 07:37:56 by Caleb
Caleb, your right, but it really should be unrolled. Use two registers like EAX & EDX and alternate their usage. If the memory is contigious then certainly use movsd, but sometimes you want to move just part of a structure and then you'd need to construct a loop like this.
Posted on 2002-01-03 10:07:26 by bitRAKE
Maybe this can explain what I want to do, I have this structures.



V3D struct
_x real4 ?
_y real4 ?
_z real4 ?
V3D ends

V4D struct
_x real4 ?
_y real4 ?
_z real4 ?
_w real4 ?
V4D ends

UV struct
_u real4 ?
_v real4 ?
UV ends

RGBA struct
_red byte ?
_green byte ?
_blue byte ?
_alpha byte ?
RGBA ends

VERTEX struct
_pos V3D<>
_color RGBA<>
_normal V3D<>
_uv UV<>
_lm_uv UV<>
VERTEX ends


I wish to extract the position from an array of "VERTEX" to an array of "V4D"

Since the sizes of the source/destination structures are not equal, I cannot use rep/movsd. The thing I am not sure is that



movsd
movsd
movsd
add esi, sizeof(VERTEX)-12
add edi, 4



if faster than:



rep 2
mov eax,
add esi, 4
mov , eax
add edi, 4
endm

mov eax,
add esi, sizeof(VERTEX)-8
mov , eax
add edi, 8



or is it:

[code]
mov eax,
mov , eax
mov edx,
mov , edx
mov eax,
mov , eax
add esi, sizeof(VERTEX)
add edi, 16
[/code]

The first one is the shortest one. The last one is the largest one, Im sure of this.

Anyway, thanks everyone for your help. I will time every different loop option I can think of on a P3 and hope that the best one in P3 is still the best one for P4.

bitRAKE
I must say, this is an interesting way of moving memory. And thanks for the PDF link. I will read it rightaway.

BTW on what x86 processor did "prefetch" appeared first?
Posted on 2002-01-04 16:41:15 by dxantos
mov eax, [esi]

mov edx, [esi + 4]
mov ecx, [esi + 8]
add esi, sizeof(VERTEX)
mov [edi], eax
mov [edi + 4], edx
mov [edi + 8], ecx
add edi, 16
The fewest forward dependancies. ;)
Prefetch came along with P3, but it's on K6 (some), K7 too.
Posted on 2002-01-04 17:04:16 by bitRAKE
For the K6, I believe it came along with 3dnow, so it will be k6-ii and k6-iii ...
unless some weird k6-plain with 3dnow have been released :).
Posted on 2002-01-05 00:01:46 by f0dder
Celerons (the newest ones that have a p3 core) should have it, too.

You can forget the K6's unless your specifically tuning code for the K7, as the prefetch instructions aren't the same ones on the intel chips. But the K7 has both.
Posted on 2002-01-05 00:10:35 by bitRAKE