Hi,

I have 4 values in 4 registers. I would like to test if at least one is set to zero using only one conditional jump. I have this method :



xor reg0, reg0
sub reg1, 1
adc reg0, 0
sub reg2, 1
adc reg0, 0
sub reg3, 1
adc reg0, 0
sub reg4, 1
adc reg0, 0
jnz ZeroFound


It works but maybe there is a better solution. Thanks for your help.

(edited after bitRake's correction)
Posted on 2002-07-23 14:08:32 by Dr. Manhattan
(this doesn't work :o )
and reg0, reg2
and reg1, reg3
and reg0, reg1
je ZeroFound
Posted on 2002-07-23 14:21:59 by bitRAKE
    mov     eax, 0

mov ebx, 1
mov ecx, 1
mov edx, 2
xor esi, esi
inc esi
test eax, eax
cmovz esi, eax
test ebx, ebx
cmovz esi, ebx
test ecx, ecx
cmovz esi, ecx
test edx, edx
cmovz esi, edx
test esi, esi
jnz @F

;Zero Detected

invoke MessageBox, 0, 0, 0, 0

@@:
A rather slow one, but at least we have a variation. ;)
Posted on 2002-07-23 14:32:32 by stryker
Thanks for the answer.
stryker, your method works but I'd rather not use the cmov instruction.
bitRake, your method doesn't work. Try reg1 = 001h and reg3 = 010h for example. I made the same mistake :)
Posted on 2002-07-23 15:12:46 by Dr. Manhattan
Dr. Manhattan, back to the drawing board...
How will your method work?
'sub reg,0' will never set carry flag.
Posted on 2002-07-23 17:00:59 by bitRAKE
You're right, it's sub reg, 1.
Posted on 2002-07-23 17:10:21 by Dr. Manhattan
	xor reg0, reg0

cmp reg1, 1
adc reg0, reg0
cmp reg2, 1
adc reg0, reg0
cmp reg3, 1
adc reg0, reg0
cmp reg4, 1
adc reg0, reg0
jnz ZeroFound
This one is kind of cool in that is doesn't destroy the register values, and you can see which registers were zero, and four bytes smaller. :) I couldn't think of anything else right now, but I know I've seen this problems before - just can't remember the other solution.

You could save an instruction with:
	cmp reg1, 1

sbb reg0, reg0
cmp reg2, 1
adc reg0, reg0
cmp reg3, 1
adc reg0, reg0
cmp reg4, 1
adc reg0, reg0
jnz ZeroFound
...big deal. :grin:
Posted on 2002-07-23 18:42:23 by bitRAKE
    ;Test Cases


mov esi, 10
mov ebx, 10
mov ecx, 0
mov edx, 10

;Core Algo

xor eax, eax
test esi, esi
setz al
shl eax, 8
test ebx, ebx
setz al
shl eax, 8
test ecx, ecx
setz al
shl eax, 8
test edx, edx
setz al
and eax, 00000001000000010000000100000001b
jz @F

;Zero Detected

invoke MessageBox, 0, 0, 0, 0

@@:
Another one. :)
Posted on 2002-07-23 19:19:16 by stryker
Might be some penalties using al/eax like that? :tongue:
Posted on 2002-07-23 20:48:32 by bitRAKE
Yep! :grin: Anyway, that was just a variation, besides set?? is "slow" for me. We'll just use your version, it's shorter and faster. :tongue:
Posted on 2002-07-23 20:52:34 by stryker
Thanks for your help :alright:
Posted on 2002-07-23 23:55:25 by Dr. Manhattan


Having 4 conditional jumps would be considerably more efficient.

I discussed this in detail in several posts, even recently.
Posted on 2002-07-24 02:12:56 by Maverick
If you need one more variation (ecx,edx,esi,edi are values to be tested):


or ecx,ecx
lahf
xchg ebx,eax
or edx,edx
lahf
or bh,ah
or esi,esi
lahf
or bh,ah
or edi,edi
lahf
or bh,ah
test bh,1000000b
jnz zero_detected

(24 bytes)

Using different methods in your code can always give more fun for someone who will be debugging it... ;)
Posted on 2002-07-24 06:13:42 by Tomasz Grysztar



Having 4 conditional jumps would be considerably more efficient.

I discussed this in detail in several posts, even recently.
I could argue a case for the one branch method, but I wont. :tongue:
Posted on 2002-07-24 08:13:40 by bitRAKE
I guess it's when no branch is taken (so all the instructions have to be executed anyway).

Well, I could argue that you can omit the whole routine then. ;)

So, to be serious (as I like to be in these issues) we should study the specific case statistically, and then produce the optimal code for that specific case. I recall I wrote these words even when I mentioned that the 4 branches in line were averagely much better, in those posts I mentioned.

Anyway, happy coding.. and let's not waste days on 10 lines of asm coding, or at the end of our life we'll have produced less than half a program. :grin:
Posted on 2002-07-24 11:07:47 by Maverick
To estimate truth in the discussion more or less
precisely we need additional data to calculate
diviation,probability and aproximation through
probability in given conditions and enviroment.

But it's clear without even additional data that Meveric is right
about 4 test 4 jumps in:
1. Size.
If label within near jmp then size may be (with 4 test and 4 jcc)
4*(2+2)=16
If label out of near jmp then 3 jcc near + 1 short solution may be done
with result of 3*2(near jcc)+5(short jcc)+2*4(tests)=19
if label zerocase out of near jcc


test reg1,reg1
je pzero ;near 2b
test reg2,reg2
je pzero ;near 2b
test reg3,reg3
je pzero ;near 2b
test reg4,reg4
pzero:
je casezero ;short 5b

Author of the first post didn't say about size of reg and if label is near
but anyway with 32 bit regs
its 3*4(cmp reg,1)+3*4(adc reg,0)+2(xor)+{2,5}(near or short jcc) = {28,31}
2. Speed
-for any value of additional data if processor <= PMMX
-for any value and any processor if first time execution
-for any value and any processor if predicted as not taken
-for any value and any processor if none of regs is zero (predicted of misspredicted)

Additional data we need is
1. Processor (to estimate penalty value)
2. Environment (to calculate probability of any possible case and make aproximations)

Now lets p is value of penalty for given processor
With PPlain and PMMX it's {4,5}
Let's assume max penalty 5.

Then time of execution:
Case predicted as not taken:
OneJccAlgo = 9
4JccAlgo = 4
9-4=5 in favor of 4JccAlgo

Case OneJccAlgo misspredicted as taken(4JccAlgo will be misspredicted anyway then):
OneJccAlgo = 9+p
4JccAlgo = 4+p
(9+p)-(4+p) = 5 in favor of 4JccAlgo

For processor <=PMMX when OneJccAlgo jcc is predicted but 4JccAlgo is misspredicted.
It's possible if in interatin at least one of regs was 0, and in iteration one of regs is 0
but it's different reg
so regx<>regx and reg is not 0 in iteration

where regx is register that was cought as 0 in 4JccAlgo in iteration and then
in iteration was changed to not zero value,
and reg is the first of registers that have zero value in iteration

Then:
OneJccAlgo=9
4JccAlgo=i+p where i=1based index of reg {1,2,3,4}
if processor <=PMMX then p(max)=5 (maximum pipeline flashed)
so i+p=5+{1,2,3,4)={6,7,8,9}
You can see that even in the worst case 4JccAlgo = OneJccAlgo, for the rest 3 cases it's shorter
even though 4JccAlgo has penalty and OneJccAlgo doesn't.

For processor > PMMX penalty is from 10 to 20 and we need aditional data to calculate
probability of the last case execution.
Then we can aproximate advantages of 4JccAlgo in not last case execution and make table of
diviation between the two algos.

And I might say that we need to have very special conditions in wich OneJccAlgo will be more or
at least equally effective as 4JccAlgo.
But let us not speculate about it, there are no such thing as "normaly" discussing probability of
unknown task, when task is unknown then probability is also unknown.

So for further discussion let's ask the author of the first post if he meant some particular task.
Posted on 2002-07-25 16:06:34 by The Svin
Hi The Svin, since you're not new to this kind of utilities, another good project would be to collect these "frequently asked code snippets" and make a program that, for each of them, with the help of additional specific options, given the target CPU and some "statistical info", produces the most optimized asm code snippet, to be included in one's program. If this utility/program was modular (e.g. plug-ins based, with a solid and standardized common interface) then this could even become a big, long term project to which many of us could contribute. It would be a good development tool for every one that wants very optimized asm code. I suggest though to not count cycles on paper, but to always profile real situations, since e.g. cache issues can have a very dramatic effect that may not be easily spotted on paper.

PS: just to know: have you received my PM (sent also in email) dated 21st July?
Posted on 2002-07-26 08:06:45 by Maverick
Another pointless way of doing it with only 1 conditional jump. ;)



push 1
push reg0
push reg1
push reg2
push reg3
mov edi, esp
xor eax, eax
mov ecx, 05h
cld
repnz scasd
add esp, 14h
or ecx, ecx
jz no_regs_are_zero
Posted on 2002-07-26 11:16:05 by iblis
Thank you all for your suggestions. I need this test for a Connect-4 game. It's the game where the winner is the first player who aligns 4 square horizontally, diagonally or vertically.
The question was about the horizontal test. The row to test is stored in a dword with 2 bit for each square. I want to test if there are 4 aligned squares for a player 'p'. In this case there are only four possibilities :

ppppxxx
xppppxx
xxppppx
xxxpppp

So I copy the row 'abcdefg' in four registers. Then I xor each register with the mask 'ppppppp'.

reg1 = abcdefg XOR ppppppp
reg2 = abcdefg XOR ppppppp
reg3 = abcdefg XOR ppppppp
reg4 = abcdefg XOR ppppppp

Then I do :

reg1 = reg1 AND 1111000
reg2 = reg2 AND 0111100
reg3 = reg3 AND 0011110
reg4 = reg4 AND 0001111

Now player 'p' wins if at least one register is zero.
I did this way to avoid dependencies and jump. It seems complicated but I don't know branch prediction very well. I've read there can be a big penalty in case of misprediction so I only try to avoid jumps. Here is the actual code before the test :




; row (abcdefg) in eax
; player mask (ppppppp) in edi

mov ecx, eax
mov ebx, eax
mov edx, eax

xor eax, edi
xor ecx, edi
xor ebx, edi
xor edx, edi
xor ebp, ebp

and eax, 011111111b
and ecx, 011111111b SHL 2
and ebx, 011111111b SHL 4
and edx, 011111111b SHL 6
Posted on 2002-07-26 13:41:20 by Dr. Manhattan
Well, in that case, I'd keep the player data separate - this way you can use the same algorithm for both players and not have to use a player mask. Also, the typical board is 7x6, which means an MMX version of the algo could be designed with all the possitions in one register. Now reverse the state of the bits - so that one means no piece and zero means a piece is at that location. Now you can just use a series of TEST instruction for all possiblities:
; EAX:EDX = 7x6 board

test eax, 1111y
je _0
test eax, 11110y
je _1
test eax, 111100y
je _2
test eax, 1111000y
je _3
test eax, 11110000000y
je _4

; etc...
This could be done with macros to generate the code. ;)

4x6 horizontal matches
3x7 vertical matches
12x2 diagonal matches
=69 tests (most will be on one register)

Also, the bits could be aranged non-regularly to speed up the exclusion of possible matches. Test only needs to be made of player who just placed a piece, or both if loading a game is allowed.

This will be fastest because all jumps are predicted not taken and there is no access to memory. The computer could easily test all possible combinations in under a sec, but you'll most likely want mulitple levels of look-ahead for the CPU player.
Posted on 2002-07-26 15:42:25 by bitRAKE