I'm working on a knight tours program, and I need to count 0 bytes in a table of 36 integers (or chars, if you prefer).
Here is the C code:
count=!t[0]+!t[1]+!t[2]+!t[3]+!t[4]+!t[5]+!t[6]+!t[7]+!t[8]+!t[9]+!t[10]+!t[11]+!t[12]+!t[13]+!t[14]+!t[15]+!t[16]+!t[17]+!t[18]+!t[19]+!t[20]+!t[21]+!t[22]+!t[23]+!t[24]+!t[25]+!t[26]+!t[27]+!t[28]+!t[29]+!t[30]+!t[31]+!t[32]+!t[33]+!t[34]+!t[35];
t[] may vary from -16 to +8
The Intel compiler generates a terrible code, like this (I cleaned it up):
xor ebx,ebx
xor edx, edx
mov eax,
sete bl
mov eax,
test eax,eax
sete dl
add ebx,edx
etc...
Since I need a very fast program, what is the fastest way ?
JC
Here is the C code:
count=!t[0]+!t[1]+!t[2]+!t[3]+!t[4]+!t[5]+!t[6]+!t[7]+!t[8]+!t[9]+!t[10]+!t[11]+!t[12]+!t[13]+!t[14]+!t[15]+!t[16]+!t[17]+!t[18]+!t[19]+!t[20]+!t[21]+!t[22]+!t[23]+!t[24]+!t[25]+!t[26]+!t[27]+!t[28]+!t[29]+!t[30]+!t[31]+!t[32]+!t[33]+!t[34]+!t[35];
t[] may vary from -16 to +8
The Intel compiler generates a terrible code, like this (I cleaned it up):
xor ebx,ebx
xor edx, edx
mov eax,
sete bl
mov eax,
test eax,eax
sete dl
add ebx,edx
etc...
Since I need a very fast program, what is the fastest way ?
JC
You mean that it is 8bits? Okay then.
Result is stored in edx. Hopefully it works
made a mistake
xor edx,edx
lea esi, t
mov ecx,9
_loop:
lodsd
test eax,eax
jnz @F
inc edx
@@:
test eax,0FFFFFFh
jnz @F
inc edx
@@:
test eax,0FFFFh
jnz @F
inc edx
@@:
test eax,0FFh
jnz @F
inc edx
@@:
dec ecx
jnz _loop
Result is stored in edx. Hopefully it works
made a mistake
Wait it seems to me that t is an arrary of 32bit ... hmm.. so
xor edx,edx
mov ecx,36
lea esi, t
_loop:
lodsd
test eax,eax
jnz @F
inc edx
@@:
add esi,4
dec ecx
jnz _loop
Roticv, thanks for the reply, but I doubt replacing a routine by a loop is faster, or am I wrong ?
Your trick about loading 4 bytes in eax is simple, but fits well on my problem !
Does using branches is faster than using set ?
JC
Your trick about loading 4 bytes in eax is simple, but fits well on my problem !
Does using branches is faster than using set ?
JC
Speed I am not sure, but definitely size. I would rather go into branches to save space. Hopefully the other experts can answer your question, I am still an intermediate.
Roticv:
The lodsd already increases ESI by 4. You should not increase ESI by another 4 in the loop.
MCoder:
You may want to try the following:
lea edi,t
xor edx,edx
mov ecx,36
mov al,0
@@:
repnz scasb
jnz finish
inc edx
or ecx,ecx
jnz @B
finish:
;the count of 0's is in edx
Just ask if you need some explanation for the suggested code.
Raymond
The lodsd already increases ESI by 4. You should not increase ESI by another 4 in the loop.
MCoder:
You may want to try the following:
lea edi,t
xor edx,edx
mov ecx,36
mov al,0
@@:
repnz scasb
jnz finish
inc edx
or ecx,ecx
jnz @B
finish:
;the count of 0's is in edx
Just ask if you need some explanation for the suggested code.
Raymond
lea eax, Table
mov edx, 36
xor ecx, ecx
@@:
cmp DWORD PTR [eax + edx*4 - 4], 1
adc ecx, 0
dec edx
jnz @B
Mirno
MCoder:
Mirno's post made me look at your original post in more detail.
You had mentionned that your table had 36 integers (or chars, if you prefer). Since "char" is usually interpreted as meaning a "byte", I used that interpretation for my previous posting.
However, looking at the code generated by your C code compiler, it would appear that the compiler is treating the table as an array of dwords!
Which one is right?????
If they are dwords, then my posted code would need to be altered accordingly.
mov al,0 ==> xor eax,eax
scasb ==> scasd
Raymond
Mirno's post made me look at your original post in more detail.
You had mentionned that your table had 36 integers (or chars, if you prefer). Since "char" is usually interpreted as meaning a "byte", I used that interpretation for my previous posting.
However, looking at the code generated by your C code compiler, it would appear that the compiler is treating the table as an array of dwords!
Which one is right?????
If they are dwords, then my posted code would need to be altered accordingly.
mov al,0 ==> xor eax,eax
scasb ==> scasd
Raymond
Raymond, Mirno and Roticv, thank you !
I simply needed to count 0 bytes (or dwords, if faster) in an array.
All the tricks you posted are elegant, and it's great to see that so many different ideas have been submitted.
I finally managed to compute the number of zeroes incrementally, so I've solved my problem.
I'm working on a magical knight tour program (you put one chess knight on a chessboard, and you try to jump and reach all squares and filling every square with the number of the jump, and when the board is full of numbers, you compute every sum of horizontal and vertical lines to verify that they match the magic sum 260), and I hope to finish it by the end of this week.
The problem is that I have to run the program on a distributed network, so it has to be EXTREMELY optimized because the total run time of the program is expected to be around one year of P4 1Ghz. In this case, saving a couple of cycles saves a couple of weeks of computation !
I realized that a second look is VERY important, so I'll post my full code here, to get your advice about possible assembly optimizations I could have missed (I'm currently optimizing my algorithm).
Thanks again !
JC
I simply needed to count 0 bytes (or dwords, if faster) in an array.
All the tricks you posted are elegant, and it's great to see that so many different ideas have been submitted.
I finally managed to compute the number of zeroes incrementally, so I've solved my problem.
I'm working on a magical knight tour program (you put one chess knight on a chessboard, and you try to jump and reach all squares and filling every square with the number of the jump, and when the board is full of numbers, you compute every sum of horizontal and vertical lines to verify that they match the magic sum 260), and I hope to finish it by the end of this week.
The problem is that I have to run the program on a distributed network, so it has to be EXTREMELY optimized because the total run time of the program is expected to be around one year of P4 1Ghz. In this case, saving a couple of cycles saves a couple of weeks of computation !
I realized that a second look is VERY important, so I'll post my full code here, to get your advice about possible assembly optimizations I could have missed (I'm currently optimizing my algorithm).
Thanks again !
JC
quick testy stuff while drinking beers ^_^
made for bytes (your range is well within byte range, so let's stick with those).
have fun, this can probably be optimized even more.
made for bytes (your range is well within byte range, so let's stick with those).
have fun, this can probably be optimized even more.
should be noted that the stuff is done on 32 bytes, as that divides evently with 8... should be trivial to add the last couple bytes. New version some more stuff. And some results, from my 2.53ghz P4 that ran a whole lot of other stuff at the same time:
included the .objs in the .zip this time so you can disasm if you want to. compiled with vs.net, wonder how intel c++ v7 would do.
cz1 - simple C code
cz2 - TryToBeClever C code
cz3 - mirno SimpleLoop
cz4 - MMX 1 (with *cough* help from scali)
cz5 - MMX 2 (butchered from scali)
cz6 - scali pplain
cz7 - scali pplain (SLOW ANSI C)
cz1 cz2 cz3 cz4 cz5 cz6 cz7
run 1: (000266)(000468)(000313)(000062)(000063)(000172)(000125)
run 2: (000266)(000469)(000312)(000063)(000062)(000172)(000125)
run 3: (000250)(000453)(000312)(000078)(000063)(000156)(000109)
included the .objs in the .zip this time so you can disasm if you want to. compiled with vs.net, wonder how intel c++ v7 would do.
compiled with intel c++ compiler v7. Plain ANSI C version is down to 78 ticks on my box now, too.
Somebody find a faster algorithm :)
Somebody find a faster algorithm :)
forgot the attachment
A tip for the knight's tour problem:
Jump to the square where the number of jumps after that is as small as possible first. Doing so, you will fill the whole board in O(N^2) where N is 8 on a standard chess board. I don't know if the magic sum will be 260 though but I guess that it's just to continue recursivly to the squares which has more jumps.
Jump to the square where the number of jumps after that is as small as possible first. Doing so, you will fill the whole board in O(N^2) where N is 8 on a standard chess board. I don't know if the magic sum will be 260 though but I guess that it's just to continue recursivly to the squares which has more jumps.
Nice work Mirno,
You made used to CF flag and adc to remove the need for a branch.
You made used to CF flag and adc to remove the need for a branch.
Mirnos solution is short and elegant, but slower than what vc7 and icl7 generate from na?ve C++ code :)
Intel P3 - 800 mhz, 256 mb SDRAM
[size=9]BITS 32
SECTION .text
EXTERN _table
GLOBAL _cz8
ALIGN 32
_cz8:
movq MM0, [_table]
movq MM1, [_table+8]
movq MM2, [_table+16]
movq MM3, [_table+24]
pxor MM4, MM4
pcmpeqb MM0, MM4
pcmpeqb MM1, MM4
pcmpeqb MM2, MM4
pcmpeqb MM3, MM4
pcmpeqb MM4, MM4
psubb MM0, MM4
psubb MM1, MM4
psubb MM2, MM4
psubb MM3, MM4
paddd MM0, MM1
paddd MM2, MM3
paddd MM0, MM2
movq MM1, MM0
psrlq MM0, 32
paddb MM0, MM1
movq MM1, MM0
psrlq MM0, 16
paddb MM0, MM1
movq MM1, MM0
psrlq MM0, 8
paddb MM0, MM1
movd eax, MM0
and eax, 0FFh
sub eax, 32
neg eax
ret[/size]
Posted on 2003-04-04 02:10:13 by arkane
Like what they would say, someone is bound to come up with MMX code that would do things in a shorter time and in an unknown lanugage :grin:
arkane, I'll check your stuff when I get home - looks promising. Which compiler did you use?
umm VC 6.
I tried to design that so you can have 2 types of answers by removing the two instructions.
here's the output of cz_izl7 (attached on previous post)...
I tried to design that so you can have 2 types of answers by removing the two instructions.
here's the output of cz_izl7 (attached on previous post)...
E:\Temp\cz_icl7>test
doing tests...done
(000741 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000751 ticks) - cz3 - mirno SimpleLoop
(000150 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000160 ticks) - cz5 - MMX 2 (butchered from scali)
(000511 ticks) - cz6 - scali pplain
(000280 ticks) - cz7 - scali pplain (SLOW ANSI C)
E:\Temp\cz_icl7>test
doing tests...done
(000751 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000741 ticks) - cz3 - mirno SimpleLoop
(000161 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000150 ticks) - cz5 - MMX 2 (butchered from scali)
(000501 ticks) - cz6 - scali pplain
(000280 ticks) - cz7 - scali pplain (SLOW ANSI C)
E:\Temp\cz_icl7>test
doing tests...done
(000741 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000741 ticks) - cz3 - mirno SimpleLoop
(000160 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000150 ticks) - cz5 - MMX 2 (butchered from scali)
(000511 ticks) - cz6 - scali pplain
(000271 ticks) - cz7 - scali pplain (SLOW ANSI C)
I have a weird computer. :grin:doing tests...done
(000741 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000751 ticks) - cz3 - mirno SimpleLoop
(000150 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000160 ticks) - cz5 - MMX 2 (butchered from scali)
(000511 ticks) - cz6 - scali pplain
(000280 ticks) - cz7 - scali pplain (SLOW ANSI C)
E:\Temp\cz_icl7>test
doing tests...done
(000751 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000741 ticks) - cz3 - mirno SimpleLoop
(000161 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000150 ticks) - cz5 - MMX 2 (butchered from scali)
(000501 ticks) - cz6 - scali pplain
(000280 ticks) - cz7 - scali pplain (SLOW ANSI C)
E:\Temp\cz_icl7>test
doing tests...done
(000741 ticks) - cz1 - simple C code
(000721 ticks) - cz2 - TryToBeClever C code
(000741 ticks) - cz3 - mirno SimpleLoop
(000160 ticks) - cz4 - MMX 1 (with *cough* help from scali)
(000150 ticks) - cz5 - MMX 2 (butchered from scali)
(000511 ticks) - cz6 - scali pplain
(000271 ticks) - cz7 - scali pplain (SLOW ANSI C)