This code subtracts a fixed number (32*4) from esi when ecx is a multiple of 32. Any ideas on how to optimize this? I'm looking for a way to remove the jump.

EDIT: And I have only eax available as a temp register....

test ecx,32-1
jnz @F
sub esi,32*4
Posted on 2002-11-17 02:26:26 by Qweerdy
lea eax, [ecx + 32 - 1]

; (space for instruction)
xor eax, ecx
; (space for instruction)
and eax, 32
; (space for instruction)
lea esi, [esi + eax*4][-128]
EAX is zero only in the case that ECX is a multiple of 32. ;)
Posted on 2002-11-17 02:51:38 by bitRAKE
Ok, I've made the changes. Now my inner loop looks like this:

mov eax,[esi]
inc ecx
mov [edi],eax
lea eax,[ecx+32-1]
add esi,4
xor eax,ecx
add edi,4
and eax,32
cmp ecx,[ebx]
lea esi,[esi+eax*4][-128]
jne CellLoop

Is this optimal?

Great piece of work btw, I couldn't have figured that out myself... wow.
Posted on 2002-11-17 04:40:38 by Qweerdy
bitRake, Qweerdy,
Treat me as a newbie on this one.

bitRake how does your algo work? and why do u leave spaces for instructions in between?

Also how does
"test ecx,32-1" check ecx for being a multiple of 32.

Whatever i am asking may sound completly stupid but i am pretty new to asm.
Posted on 2002-11-17 05:08:50 by clippy

Also how does
"test ecx,32-1" check ecx for being a multiple of 32.

1100000 = 32 * 3 in binary
100000 = 32 in binary
11111 = 32-1 in binary

so if the lower 5 bits of a number are clear, the number is a multiple of 32.
Posted on 2002-11-17 06:13:24 by Qweerdy
gladiator, it is binary math. It works by finding if the bit-5 has changed by adding 31. If bit-5 changes then ECX is not a multiple of 32 (ex. 32/32 = 1; 32+31 = 63; 63/32 = 1; but: 33/32 = 1; 33+31=64; 64/32 = 2). Spaces are left for older processors that can't execute instructions out of order - trying to eliminate dependancies between instructions. Agner Fog's optimization guide would be a good read for you.

Qweerdy, should not be accessed in the loop - the number of loops needed can be calculated. I didn't assume ECX is zero on entry to your snippet. This could be trimmed up a little if left side is always aligned.
	lea eax, [ecx+32-1]

and ecx, 32-1
and eax, -32
lea esi, [esi + ecx*4]
sub eax, [ebx]
neg eax
rep movsd
mov ecx, 32
sub esi, 128
sub eax, 32
jnc CellLoop
; do right unaligned dwords
add ecx, eax
rep movsd
Posted on 2002-11-17 11:41:38 by bitRAKE
I'm sorry, but I couldn't get your snippet to work :(

Since you've apparently already downloaded the complete source from my website, could you please post the whole proc?
Posted on 2002-11-17 12:08:15 by Qweerdy
thanks bitRake and Qweerdy.
I understand now.

P.S.- Where can i come to know these tricks about binary math?
Posted on 2002-11-18 13:57:37 by clippy

P.S.- Where can i come to know these tricks about binary math?

I learnt from my first ASM book and I use binary operations all the time. There really aren't any tricks - it looks that way sometimes, but it is just experience.
Posted on 2002-11-18 14:19:02 by bitRAKE