I'm working on a knight tours program, and I need to count 0 bytes in a table of 36 integers (or chars, if you prefer).
Here is the C code:

count=!t[0]+!t[1]+!t[2]+!t[3]+!t[4]+!t[5]+!t[6]+!t[7]+!t[8]+!t[9]+!t[10]+!t[11]+!t[12]+!t[13]+!t[14]+!t[15]+!t[16]+!t[17]+!t[18]+!t[19]+!t[20]+!t[21]+!t[22]+!t[23]+!t[24]+!t[25]+!t[26]+!t[27]+!t[28]+!t[29]+!t[30]+!t[31]+!t[32]+!t[33]+!t[34]+!t[35];

t[] may vary from -16 to +8

The Intel compiler generates a terrible code, like this (I cleaned it up):

xor ebx,ebx
xor edx, edx
mov eax,
sete bl
mov eax,
test eax,eax
sete dl
add ebx,edx
etc...

Since I need a very fast program, what is the fastest way ?

JC
Posted on 2003-04-03 05:23:13 by MCoder
You mean that it is 8bits? Okay then.



xor edx,edx
lea esi, t
mov ecx,9
_loop:
lodsd
test eax,eax
jnz @F
inc edx
@@:
test eax,0FFFFFFh
jnz @F
inc edx
@@:
test eax,0FFFFh
jnz @F
inc edx
@@:
test eax,0FFh
jnz @F
inc edx
@@:
dec ecx
jnz _loop


Result is stored in edx. Hopefully it works
made a mistake
Posted on 2003-04-03 05:47:27 by roticv
Wait it seems to me that t is an arrary of 32bit ... hmm.. so



xor edx,edx
mov ecx,36
lea esi, t
_loop:
lodsd
test eax,eax
jnz @F
inc edx
@@:
add esi,4
dec ecx
jnz _loop
Posted on 2003-04-03 05:52:03 by roticv
Roticv, thanks for the reply, but I doubt replacing a routine by a loop is faster, or am I wrong ?

Your trick about loading 4 bytes in eax is simple, but fits well on my problem !

Does using branches is faster than using set ?

JC
Posted on 2003-04-03 07:17:48 by MCoder
Speed I am not sure, but definitely size. I would rather go into branches to save space. Hopefully the other experts can answer your question, I am still an intermediate.
Posted on 2003-04-03 07:55:33 by roticv
Roticv:
The lodsd already increases ESI by 4. You should not increase ESI by another 4 in the loop.

MCoder:
You may want to try the following:

lea edi,t
xor edx,edx
mov ecx,36
mov al,0

@@:
repnz scasb
jnz finish
inc edx
or ecx,ecx
jnz @B
finish:
;the count of 0's is in edx

Just ask if you need some explanation for the suggested code.

Raymond
Posted on 2003-04-03 11:17:24 by Raymond


lea eax, Table
mov edx, 36
xor ecx, ecx

@@:
cmp DWORD PTR [eax + edx*4 - 4], 1
adc ecx, 0
dec edx
jnz @B



Mirno
Posted on 2003-04-03 12:34:49 by Mirno
MCoder:

Mirno's post made me look at your original post in more detail.

You had mentionned that your table had 36 integers (or chars, if you prefer). Since "char" is usually interpreted as meaning a "byte", I used that interpretation for my previous posting.

However, looking at the code generated by your C code compiler, it would appear that the compiler is treating the table as an array of dwords!

Which one is right?????

If they are dwords, then my posted code would need to be altered accordingly.

mov al,0 ==> xor eax,eax
scasb ==> scasd

Raymond
Posted on 2003-04-03 13:08:24 by Raymond
Raymond, Mirno and Roticv, thank you !

I simply needed to count 0 bytes (or dwords, if faster) in an array.
All the tricks you posted are elegant, and it's great to see that so many different ideas have been submitted.

I finally managed to compute the number of zeroes incrementally, so I've solved my problem.

I'm working on a magical knight tour program (you put one chess knight on a chessboard, and you try to jump and reach all squares and filling every square with the number of the jump, and when the board is full of numbers, you compute every sum of horizontal and vertical lines to verify that they match the magic sum 260), and I hope to finish it by the end of this week.
The problem is that I have to run the program on a distributed network, so it has to be EXTREMELY optimized because the total run time of the program is expected to be around one year of P4 1Ghz. In this case, saving a couple of cycles saves a couple of weeks of computation !

I realized that a second look is VERY important, so I'll post my full code here, to get your advice about possible assembly optimizations I could have missed (I'm currently optimizing my algorithm).

Thanks again !

JC
Posted on 2003-04-03 14:33:44 by MCoder
quick testy stuff while drinking beers ^_^
made for bytes (your range is well within byte range, so let's stick with those).
have fun, this can probably be optimized even more.
Posted on 2003-04-03 14:41:57 by f0dder
should be noted that the stuff is done on 32 bytes, as that divides evently with 8... should be trivial to add the last couple bytes. New version some more stuff. And some results, from my 2.53ghz P4 that ran a whole lot of other stuff at the same time:



cz1 - simple C code
cz2 - TryToBeClever C code
cz3 - mirno SimpleLoop
cz4 - MMX 1 (with *cough* help from scali)
cz5 - MMX 2 (butchered from scali)
cz6 - scali pplain
cz7 - scali pplain (SLOW ANSI C)

cz1 cz2 cz3 cz4 cz5 cz6 cz7
run 1: (000266)(000468)(000313)(000062)(000063)(000172)(000125)
run 2: (000266)(000469)(000312)(000063)(000062)(000172)(000125)
run 3: (000250)(000453)(000312)(000078)(000063)(000156)(000109)


included the .objs in the .zip this time so you can disasm if you want to. compiled with vs.net, wonder how intel c++ v7 would do.
Posted on 2003-04-03 15:12:49 by f0dder
compiled with intel c++ compiler v7. Plain ANSI C version is down to 78 ticks on my box now, too.

Somebody find a faster algorithm :)
Posted on 2003-04-03 15:35:39 by f0dder
forgot the attachment
Posted on 2003-04-03 15:36:13 by f0dder
A tip for the knight's tour problem:
Jump to the square where the number of jumps after that is as small as possible first. Doing so, you will fill the whole board in O(N^2) where N is 8 on a standard chess board. I don't know if the magic sum will be 260 though but I guess that it's just to continue recursivly to the squares which has more jumps.
Posted on 2003-04-03 23:45:55 by gliptic
Nice work Mirno,

You made used to CF flag and adc to remove the need for a branch.
Posted on 2003-04-04 00:07:34 by roticv
Mirnos solution is short and elegant, but slower than what vc7 and icl7 generate from na?ve C++ code :)
Posted on 2003-04-04 00:28:58 by f0dder
Intel P3 - 800 mhz, 256 mb SDRAM
[size=9]BITS 32


SECTION .text
EXTERN _table
GLOBAL _cz8

ALIGN 32
_cz8:

movq MM0, [_table]
movq MM1, [_table+8]
movq MM2, [_table+16]
movq MM3, [_table+24]
pxor MM4, MM4

pcmpeqb MM0, MM4
pcmpeqb MM1, MM4
pcmpeqb MM2, MM4
pcmpeqb MM3, MM4

pcmpeqb MM4, MM4

psubb MM0, MM4
psubb MM1, MM4
psubb MM2, MM4
psubb MM3, MM4

paddd MM0, MM1
paddd MM2, MM3
paddd MM0, MM2

movq MM1, MM0
psrlq MM0, 32
paddb MM0, MM1

movq MM1, MM0
psrlq MM0, 16
paddb MM0, MM1

movq MM1, MM0
psrlq MM0, 8
paddb MM0, MM1

movd eax, MM0
and eax, 0FFh
sub eax, 32
neg eax

ret[/size]
Posted on 2003-04-04 02:10:13 by arkane
Like what they would say, someone is bound to come up with MMX code that would do things in a shorter time and in an unknown lanugage :grin:
Posted on 2003-04-04 02:14:42 by roticv
arkane, I'll check your stuff when I get home - looks promising. Which compiler did you use?
Posted on 2003-04-04 02:17:02 by f0dder
umm VC 6.

I tried to design that so you can have 2 types of answers by removing the two instructions.




here's the output of cz_izl7 (attached on previous post)...

E:\Temp\cz_icl7>test
doing tests...done
(000741 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000751 ticks) - cz3 - mirno SimpleLoop
(000150 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000160 ticks) - cz5 - MMX 2 (butchered from scali)
(000511 ticks) - cz6 - scali pplain
(000280 ticks) - cz7 - scali pplain (SLOW ANSI C)

E:\Temp\cz_icl7>test
doing tests...done
(000751 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000741 ticks) - cz3 - mirno SimpleLoop
(000161 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000150 ticks) - cz5 - MMX 2 (butchered from scali)
(000501 ticks) - cz6 - scali pplain
(000280 ticks) - cz7 - scali pplain (SLOW ANSI C)

E:\Temp\cz_icl7>test
doing tests...done
(000741 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000741 ticks) - cz3 - mirno SimpleLoop
(000160 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000150 ticks) - cz5 - MMX 2 (butchered from scali)
(000511 ticks) - cz6 - scali pplain
(000271 ticks) - cz7 - scali pplain (SLOW ANSI C)
I have a weird computer. :grin:
Posted on 2003-04-04 02:21:19 by arkane