Was practicing my optimization techniques when all of a sudden, I got stuck... I was hoping someone could help me solve this "simple" stall problem:



mov bh, 0
ADD BX, AX ;stall
inc ebx ; stall


How would you avoid the stalls?

and bx, 000FFh
actually does more harm then good...

and I'm not even sure how to stop the 5-6 clocks i lose on the stall at inc ebx...

Any help in this

Sliver
Posted on 2002-03-24 21:52:51 by Sliver
Sliver,

try the Intel documented technique of XOR EBX, EBX or SUB EBX, EBX before you use the partial register BX. I have used this method for some time and it solves the problem.

Regards,

hutch@movsd.com
Posted on 2002-03-24 22:02:44 by hutch--
But what if I need to preserve (for whatever reason) bl and the top 16 bits of ebx?

I can't just xor ebx in that case...

Unfortunately I'm just practicing this and don't have a good example of when you'd want to do this, but I was hoping for a solution to this stall problem...

Another problem -- while I have your attention :)



DIV EBX
mov edx, eax
MOV EAX, 0 ; break dependency
XOR EAX, EAX ; prevent partial register stall
MOV AL, CL
ADD EBX, EAX


In my code timing I don't gain anything from the use of "mov eax, 0" and it actually hurts (by 1 clock cycle) if I'm using a variable instead of edx (ie. mov Temp, eax)

---EDIT---
From Agner Fog

Setting EAX to zero twice here seems redundant, but without the MOV EAX,0 the last instructions would have to wait for the slow DIV to finish, and without XOR EAX,EAX you would have a partial register stall.
Posted on 2002-03-24 22:11:59 by Sliver
for whatever reason,
replacing
mov bh,0
with
xor bh,bh yields a rather large speed diff on my PIII.

inserting a nop also helps, i found.

xor bh,bh
nop

ADD BX, AX ;stall
inc ebx ; stall

if you inc ax before, you can do some pairing, but it is not identical in function (fails in certain cases):

inc ax
xor bh,bh

add bx,ax
dec ax
Posted on 2002-03-24 22:18:14 by jademtech
What are you guys using to get your instruction timings--VTune?
Posted on 2002-03-24 22:51:22 by grv575

What are you guys using to get your instruction timings--VTune?

whatever works :) i myself call GetTickCount /w the function looping like 0x0FFFFFFF times, then perform a sub. accurate enough... a change of more than 5% is probably no fluke (i run it a couple of times anyway...) i got changes of 10% (/w the nop) and 50% from the xor change... definately no coincidence.

actually, i just wrote this code for this question specifically. i normally wrap my stuff in some other lang, but i only have masm on this comp.

.radix 16
.data?
timer DWORD ?
Buffer db 16 dup (?)
.code
main:
invoke GetTickCount
mov ebx,eax
push ebx
mov ecx,0FFFFFFF
@@:
push ecx
;stuff goes here
pop ecx
dec ecx
jne @B
pop ebx
invoke GetTickCount
sub eax,ebx
mov timer,eax
invoke wsprintf,addr Buffer,addr Hello,timer
invoke MessageBox,0,addr Buffer,0,0
invoke ExitProcess,0
Posted on 2002-03-24 23:12:18 by jademtech
You can get the total clocks of your program by /cl option for the linker as I think.
If you are looking for the clock of each instruction then you need the Intel reference for instructions, or any book that has something on it.
Like the one I have:Intel Microprocessors
Posted on 2002-03-24 23:41:21 by amr
In my own understanding: (I'm not sure if I'm correct :) )

mov eax, 0 breaks dependency chains because whenever you use div a dependency chain starts. Why? I don't have any technical explanation but in my own assumption, maybe div uses the eax register a lot. :)

remember the thread How DIV works. it subtracts to the destination operand multiple times?


sub eax, edx
sub eax, edx
sub eax, edx
Maybe this is how DIV operates...That's why a dependency chain starts.

Let's ask Intel/AMD... :grin:
Posted on 2002-03-25 00:32:48 by stryker


quote:
-- ------------------------------------------------------------------------------
Originally posted by grv575
What are you guys using to get your instruction timings--VTune?
--------------------------------------------------------------------------------

whatever works i myself call GetTickCount /w the function looping like 0x0FFFFFFF times, then perform a sub. accurate enough... a change of more than 5% is probably no fluke (i run it a couple of times anyway...) i got changes of 10% (/w the nop) and 50% from the xor change... definately no coincidence.


I was stimulated by this post to make another one, here
Posted on 2002-03-25 06:16:53 by Maverick

You can get the total clocks of your program by /cl option for the linker as I think.
If you are looking for the clock of each instruction then you need the Intel reference for instructions, or any book that has something on it.
Like the one I have:Intel Microprocessors


i use ml.exe with the /Sc option when i want that. but sometimes, theoretical speed != actual speed. and besides... doesn't do a good job of handling pairing or stalls (even if you define .586 or .686 at the top of your program).
Posted on 2002-03-25 17:30:29 by jademtech