This code subtracts a fixed number (32*4) from esi when ecx is a multiple of 32. Any ideas on how to optimize this? I'm looking for a way to remove the jump.

EDIT: And I have only eax available as a temp register....

EDIT: And I have only eax available as a temp register....

```
```

test ecx,32-1

jnz @F

sub esi,32*4

@@:

```
lea eax, [ecx + 32 - 1]
```

; (space for instruction)

xor eax, ecx

; (space for instruction)

and eax, 32

; (space for instruction)

lea esi, [esi + eax*4][-128]

EAX is zero only in the case that ECX is a multiple of 32. ;)Ok, I've made the changes. Now my inner loop looks like this:

Is this optimal?

Great piece of work btw, I couldn't have figured that out myself... wow.

```
```

CellLoop:

mov eax,[esi]

inc ecx

mov [edi],eax

lea eax,[ecx+32-1]

add esi,4

xor eax,ecx

add edi,4

and eax,32

cmp ecx,[ebx]

lea esi,[esi+eax*4][-128]

jne CellLoop

Is this optimal?

Great piece of work btw, I couldn't have figured that out myself... wow.

bitRake, Qweerdy,

Treat me as a newbie on this one.

bitRake how does your algo work? and why do u leave spaces for instructions in between?

Also how does

"test ecx,32-1" check ecx for being a multiple of 32.

Whatever i am asking may sound completly stupid but i am pretty new to asm.

Treat me as a newbie on this one.

bitRake how does your algo work? and why do u leave spaces for instructions in between?

Also how does

"test ecx,32-1" check ecx for being a multiple of 32.

Whatever i am asking may sound completly stupid but i am pretty new to asm.

Also how does

"test ecx,32-1" check ecx for being a multiple of 32.

```
```

1100000 = 32 * 3 in binary

100000 = 32 in binary

11111 = 32-1 in binary

so if the lower 5 bits of a number are clear, the number is a multiple of 32.

**gladiator**, it is binary math. It works by finding if the bit-5 has changed by adding 31. If bit-5 changes then ECX is not a multiple of 32 (ex. 32/32 = 1; 32+31 = 63; 63/32 = 1; but: 33/32 = 1; 33+31=64; 64/32 = 2). Spaces are left for older processors that can't execute instructions out of order - trying to eliminate dependancies between instructions. Agner Fog's optimization guide would be a good read for you.

**Qweerdy**, should not be accessed in the loop - the number of loops needed can be calculated. I didn't assume ECX is zero on entry to your snippet. This could be trimmed up a little if left side is always aligned.

```
lea eax, [ecx+32-1]
```

and ecx, 32-1

and eax, -32

lea esi, [esi + ecx*4]

sub eax, [ebx]

neg eax

CellLoop:

rep movsd

mov ecx, 32

sub esi, 128

sub eax, 32

jnc CellLoop

; do right unaligned dwords

add ecx, eax

rep movsd

I'm sorry, but I couldn't get your snippet to work :(

Since you've apparently already downloaded the complete source from my website, could you please post the whole proc?

Since you've apparently already downloaded the complete source from my website, could you please post the whole proc?

thanks bitRake and Qweerdy.

I understand now.

P.S.- Where can i come to know these tricks about binary math?

I understand now.

P.S.- Where can i come to know these tricks about binary math?

P.S.- Where can i come to know these tricks about binary math?

http://www.math.grin.edu/~rebelsky/Courses/152/97F/Readings/student-binary.html

http://www.learnbinary.com/Binary2Dec.html

I learnt from my first ASM book and I use binary operations all the time. There really aren't any tricks - it looks that way sometimes, but it is just experience.