Hi,

I have 4 values in 4 registers. I would like to test if at least one is set to zero using only one conditional jump. I have this method :

It works but maybe there is a better solution. Thanks for your help.

(edited after bitRake's correction)

I have 4 values in 4 registers. I would like to test if at least one is set to zero using only one conditional jump. I have this method :

```
```

xor reg0, reg0

sub reg1, 1

adc reg0, 0

sub reg2, 1

adc reg0, 0

sub reg3, 1

adc reg0, 0

sub reg4, 1

adc reg0, 0

jnz ZeroFound

It works but maybe there is a better solution. Thanks for your help.

(edited after bitRake's correction)

(this doesn't work :o )

and reg0, reg2

and reg1, reg3

and reg0, reg1

je ZeroFound

and reg0, reg2

and reg1, reg3

and reg0, reg1

je ZeroFound

```
mov eax, 0
```

mov ebx, 1

mov ecx, 1

mov edx, 2

xor esi, esi

inc esi

test eax, eax

cmovz esi, eax

test ebx, ebx

cmovz esi, ebx

test ecx, ecx

cmovz esi, ecx

test edx, edx

cmovz esi, edx

test esi, esi

jnz @F

;Zero Detected

invoke MessageBox, 0, 0, 0, 0

@@:

A rather slow one, but at least we have a variation. ;)Thanks for the answer.

stryker, your method works but I'd rather not use the cmov instruction.

bitRake, your method doesn't work. Try reg1 = 001h and reg3 = 010h for example. I made the same mistake :)

stryker, your method works but I'd rather not use the cmov instruction.

bitRake, your method doesn't work. Try reg1 = 001h and reg3 = 010h for example. I made the same mistake :)

**Dr. Manhattan**, back to the drawing board...

How will your method work?

'sub reg,0' will never set carry flag.

You're right, it's sub reg, 1.

```
xor reg0, reg0
```

cmp reg1, 1

adc reg0, reg0

cmp reg2, 1

adc reg0, reg0

cmp reg3, 1

adc reg0, reg0

cmp reg4, 1

adc reg0, reg0

jnz ZeroFound

This one is kind of cool in that is doesn't destroy the register values, and you can see which registers were zero, and four bytes smaller. :) I couldn't think of anything else right now, but I know I've seen this problems before - just can't remember the other solution.
You could save an instruction with:

```
cmp reg1, 1
```

sbb reg0, reg0

cmp reg2, 1

adc reg0, reg0

cmp reg3, 1

adc reg0, reg0

cmp reg4, 1

adc reg0, reg0

jnz ZeroFound

...big deal. :grin:```
;Test Cases
```

mov esi, 10

mov ebx, 10

mov ecx, 0

mov edx, 10

;Core Algo

xor eax, eax

test esi, esi

setz al

shl eax, 8

test ebx, ebx

setz al

shl eax, 8

test ecx, ecx

setz al

shl eax, 8

test edx, edx

setz al

and eax, 00000001000000010000000100000001b

jz @F

;Zero Detected

invoke MessageBox, 0, 0, 0, 0

@@:

Another one. :)Might be some penalties using al/eax like that? :tongue:

Yep! :grin: Anyway, that was just a variation, besides set?? is "slow" for me. We'll just use your version, it's shorter and faster. :tongue:

Thanks for your help :alright:

Having 4 conditional jumps would be considerably more efficient.

I discussed this in detail in several posts, even recently.

If you need one more variation (ecx,edx,esi,edi are values to be tested):

(24 bytes)

Using different methods in your code can always give more fun for someone who will be debugging it... ;)

```
```

or ecx,ecx

lahf

xchg ebx,eax

or edx,edx

lahf

or bh,ah

or esi,esi

lahf

or bh,ah

or edi,edi

lahf

or bh,ah

test bh,1000000b

jnz zero_detected

(24 bytes)

Using different methods in your code can always give more fun for someone who will be debugging it... ;)

Having 4 conditional jumps would be considerably more efficient.

I discussed this in detail in several posts, even recently.

I guess it's when no branch is taken (so all the instructions have to be executed

Well, I could argue that you can omit the whole routine then. ;)

So, to be serious (as I like to be in these issues) we should study the specific case statistically, and then produce the optimal code for that specific case. I recall I wrote these words even when I mentioned that the 4 branches in line were averagely much better, in those posts I mentioned.

Anyway, happy coding.. and let's not waste days on 10 lines of asm coding, or at the end of our life we'll have produced less than half a program. :grin:

*anyway*).Well, I could argue that you can omit the whole routine then. ;)

So, to be serious (as I like to be in these issues) we should study the specific case statistically, and then produce the optimal code for that specific case. I recall I wrote these words even when I mentioned that the 4 branches in line were averagely much better, in those posts I mentioned.

Anyway, happy coding.. and let's not waste days on 10 lines of asm coding, or at the end of our life we'll have produced less than half a program. :grin:

To estimate truth in the discussion more or less

precisely we need additional data to calculate

diviation,probability and aproximation through

probability in given conditions and enviroment.

But it's clear without even additional data that Meveric is right

about 4 test 4 jumps in:

1. Size.

If label within near jmp then size may be (with 4 test and 4 jcc)

4*(2+2)=16

If label out of near jmp then 3 jcc near + 1 short solution may be done

with result of 3*2(near jcc)+5(short jcc)+2*4(tests)=19

if label zerocase out of near jcc

Author of the first post didn't say about size of reg and if label is near

but anyway with 32 bit regs

its 3*4(cmp reg,1)+3*4(adc reg,0)+2(xor)+{2,5}(near or short jcc) = {28,31}

2. Speed

-for any value of additional data if processor <= PMMX

-for any value and any processor if first time execution

-for any value and any processor if predicted as not taken

-for any value and any processor if none of regs is zero (predicted of misspredicted)

Additional data we need is

1. Processor (to estimate penalty value)

2. Environment (to calculate probability of any possible case and make aproximations)

Now lets p is value of penalty for given processor

With PPlain and PMMX it's {4,5}

Let's assume max penalty 5.

Then time of execution:

Case predicted as not taken:

OneJccAlgo = 9

4JccAlgo = 4

9-4=5 in favor of 4JccAlgo

Case OneJccAlgo misspredicted as taken(4JccAlgo will be misspredicted anyway then):

OneJccAlgo = 9+p

4JccAlgo = 4+p

(9+p)-(4+p) = 5 in favor of 4JccAlgo

For processor <=PMMX when OneJccAlgo jcc is predicted but 4JccAlgo is misspredicted.

It's possible if in interatin at least one of regs was 0, and in iteration one of regs is 0

but it's different reg

so regx<>regx and reg is not 0 in iteration

where regx is register that was cought as 0 in 4JccAlgo in iteration and then

in iteration was changed to not zero value,

and reg is the first of registers that have zero value in iteration

Then:

OneJccAlgo=9

4JccAlgo=i+p where i=1based index of reg {1,2,3,4}

if processor <=PMMX then p(max)=5 (maximum pipeline flashed)

so i+p=5+{1,2,3,4)={6,7,8,9}

You can see that even in the worst case 4JccAlgo = OneJccAlgo, for the rest 3 cases it's shorter

even though 4JccAlgo has penalty and OneJccAlgo doesn't.

For processor > PMMX penalty is from 10 to 20 and we need aditional data to calculate

probability of the last case execution.

Then we can aproximate advantages of 4JccAlgo in not last case execution and make table of

diviation between the two algos.

And I might say that we need to have very special conditions in wich OneJccAlgo will be more or

at least equally effective as 4JccAlgo.

But let us not speculate about it, there are no such thing as "normaly" discussing probability of

unknown task, when task is unknown then probability is also unknown.

So for further discussion let's ask the author of the first post if he meant some particular task.

precisely we need additional data to calculate

diviation,probability and aproximation through

probability in given conditions and enviroment.

But it's clear without even additional data that Meveric is right

about 4 test 4 jumps in:

1. Size.

If label within near jmp then size may be (with 4 test and 4 jcc)

4*(2+2)=16

If label out of near jmp then 3 jcc near + 1 short solution may be done

with result of 3*2(near jcc)+5(short jcc)+2*4(tests)=19

if label zerocase out of near jcc

```
```

test reg1,reg1

je pzero ;near 2b

test reg2,reg2

je pzero ;near 2b

test reg3,reg3

je pzero ;near 2b

test reg4,reg4

pzero:

je casezero ;short 5b

Author of the first post didn't say about size of reg and if label is near

but anyway with 32 bit regs

its 3*4(cmp reg,1)+3*4(adc reg,0)+2(xor)+{2,5}(near or short jcc) = {28,31}

2. Speed

-for any value of additional data if processor <= PMMX

-for any value and any processor if first time execution

-for any value and any processor if predicted as not taken

-for any value and any processor if none of regs is zero (predicted of misspredicted)

Additional data we need is

1. Processor (to estimate penalty value)

2. Environment (to calculate probability of any possible case and make aproximations)

Now lets p is value of penalty for given processor

With PPlain and PMMX it's {4,5}

Let's assume max penalty 5.

Then time of execution:

Case predicted as not taken:

OneJccAlgo = 9

4JccAlgo = 4

9-4=5 in favor of 4JccAlgo

Case OneJccAlgo misspredicted as taken(4JccAlgo will be misspredicted anyway then):

OneJccAlgo = 9+p

4JccAlgo = 4+p

(9+p)-(4+p) = 5 in favor of 4JccAlgo

For processor <=PMMX when OneJccAlgo jcc is predicted but 4JccAlgo is misspredicted.

It's possible if in interatin at least one of regs was 0, and in iteration one of regs is 0

but it's different reg

so regx<>regx and reg is not 0 in iteration

where regx is register that was cought as 0 in 4JccAlgo in iteration and then

in iteration was changed to not zero value,

and reg is the first of registers that have zero value in iteration

Then:

OneJccAlgo=9

4JccAlgo=i+p where i=1based index of reg {1,2,3,4}

if processor <=PMMX then p(max)=5 (maximum pipeline flashed)

so i+p=5+{1,2,3,4)={6,7,8,9}

You can see that even in the worst case 4JccAlgo = OneJccAlgo, for the rest 3 cases it's shorter

even though 4JccAlgo has penalty and OneJccAlgo doesn't.

For processor > PMMX penalty is from 10 to 20 and we need aditional data to calculate

probability of the last case execution.

Then we can aproximate advantages of 4JccAlgo in not last case execution and make table of

diviation between the two algos.

And I might say that we need to have very special conditions in wich OneJccAlgo will be more or

at least equally effective as 4JccAlgo.

But let us not speculate about it, there are no such thing as "normaly" discussing probability of

unknown task, when task is unknown then probability is also unknown.

So for further discussion let's ask the author of the first post if he meant some particular task.

Hi

PS: just to know: have you received my PM (sent also in email) dated 21st July?

**The Svin**, since you're not new to this kind of utilities, another good project would be to collect these "frequently asked code snippets" and make a program that, for each of them, with the help of additional specific options, given the target CPU and some "statistical info", produces the most optimized asm code snippet, to be included in one's program. If this utility/program was modular (e.g. plug-ins based, with a solid and standardized common interface) then this could even become a big, long term project to which many of us could contribute. It would be a good development tool for every one that wants very optimized asm code. I suggest though to not count cycles on paper, but to always profile real situations, since e.g. cache issues can have a very dramatic effect that may not be easily spotted on paper.PS: just to know: have you received my PM (sent also in email) dated 21st July?

Another pointless way of doing it with only 1 conditional jump. ;)

```
```

push 1

push reg0

push reg1

push reg2

push reg3

mov edi, esp

xor eax, eax

mov ecx, 05h

cld

repnz scasd

add esp, 14h

or ecx, ecx

jz no_regs_are_zero

Thank you all for your suggestions. I need this test for a Connect-4 game. It's the game where the winner is the first player who aligns 4 square horizontally, diagonally or vertically.

The question was about the horizontal test. The row to test is stored in a dword with 2 bit for each square. I want to test if there are 4 aligned squares for a player 'p'. In this case there are only four possibilities :

ppppxxx

xppppxx

xxppppx

xxxpppp

So I copy the row 'abcdefg' in four registers. Then I xor each register with the mask 'ppppppp'.

reg1 = abcdefg XOR ppppppp

reg2 = abcdefg XOR ppppppp

reg3 = abcdefg XOR ppppppp

reg4 = abcdefg XOR ppppppp

Then I do :

reg1 = reg1 AND 1111000

reg2 = reg2 AND 0111100

reg3 = reg3 AND 0011110

reg4 = reg4 AND 0001111

Now player 'p' wins if at least one register is zero.

I did this way to avoid dependencies and jump. It seems complicated but I don't know branch prediction very well. I've read there can be a big penalty in case of misprediction so I only try to avoid jumps. Here is the actual code before the test :

The question was about the horizontal test. The row to test is stored in a dword with 2 bit for each square. I want to test if there are 4 aligned squares for a player 'p'. In this case there are only four possibilities :

ppppxxx

xppppxx

xxppppx

xxxpppp

So I copy the row 'abcdefg' in four registers. Then I xor each register with the mask 'ppppppp'.

reg1 = abcdefg XOR ppppppp

reg2 = abcdefg XOR ppppppp

reg3 = abcdefg XOR ppppppp

reg4 = abcdefg XOR ppppppp

Then I do :

reg1 = reg1 AND 1111000

reg2 = reg2 AND 0111100

reg3 = reg3 AND 0011110

reg4 = reg4 AND 0001111

Now player 'p' wins if at least one register is zero.

I did this way to avoid dependencies and jump. It seems complicated but I don't know branch prediction very well. I've read there can be a big penalty in case of misprediction so I only try to avoid jumps. Here is the actual code before the test :

```
```

; row (abcdefg) in eax

; player mask (ppppppp) in edi

mov ecx, eax

mov ebx, eax

mov edx, eax

xor eax, edi

xor ecx, edi

xor ebx, edi

xor edx, edi

xor ebp, ebp

and eax, 011111111b

and ecx, 011111111b SHL 2

and ebx, 011111111b SHL 4

and edx, 011111111b SHL 6

Well, in that case, I'd keep the player data separate - this way you can use the same algorithm for both players and not have to use a player mask. Also, the typical board is 7x6, which means an MMX version of the algo could be designed with all the possitions in one register. Now reverse the state of the bits - so that one means no piece and zero means a piece is at that location. Now you can just use a series of TEST instruction for all possiblities:

4x6 horizontal matches

3x7 vertical matches

12x2 diagonal matches

=69 tests (most will be on one register)

Also, the bits could be aranged non-regularly to speed up the exclusion of possible matches. Test only needs to be made of player who just placed a piece, or both if loading a game is allowed.

This will be fastest because all jumps are predicted not taken and there is no access to memory. The computer could easily test all possible combinations in under a sec, but you'll most likely want mulitple levels of look-ahead for the CPU player.

```
; EAX:EDX = 7x6 board
```

test eax, 1111y

je _0

test eax, 11110y

je _1

test eax, 111100y

je _2

test eax, 1111000y

je _3

test eax, 11110000000y

je _4

; etc...

This could be done with macros to generate the code. ;)
4x6 horizontal matches

3x7 vertical matches

12x2 diagonal matches

=69 tests (most will be on one register)

Also, the bits could be aranged non-regularly to speed up the exclusion of possible matches. Test only needs to be made of player who just placed a piece, or both if loading a game is allowed.

This will be fastest because all jumps are predicted not taken and there is no access to memory. The computer could easily test all possible combinations in under a sec, but you'll most likely want mulitple levels of look-ahead for the CPU player.