I wrote a C code generator to explore knight tours exhaustively, and I need some help to optimize it in ASM.

For the moment, my program generates a C program for the 6*6 board.
The final goal of this program is to explore semi-magic knight tours (a special case of knight tours) on the chessboard.
Computing such tours will take several years, so every saved cycle is important !

My program generates one routine for every square of the board.
Here is an extract of the code:



Init:
static int mobility[36]={1,2,3,3,2,1,2,3,5,5,3,2,3,5,7,7,5,3,3,5,7,7,5,3,2,3,5,5,3,2,1,2,3,3,2,1};

d4 square:
static void d4(int num)
{
int count=0;
if (!--mobility[34]) ++count;
if (!--mobility[32]) ++count;
if (!--mobility[10]) ++count;
if (!--mobility[8]) ++count;
if (!--mobility[29]) ++count;
if (!--mobility[17]) ++count;
if (!--mobility[25]) ++count;
if (!--mobility[13]) ++count;
if (count<2)
{
mobility[21]-=8;
++num;
if (mobility[34]>=0) if (!mobility[34] || !count) {e6(num);}
if (mobility[32]>=0) if (!mobility[32] || !count) {c6(num);}
if (mobility[10]>=0) if (!mobility[10] || !count) {e2(num);}
if (mobility[8]>=0) if (!mobility[8] || !count) {c2(num);}
if (mobility[29]>=0) if (!mobility[29] || !count) {f5(num);}
if (mobility[17]>=0) if (!mobility[17] || !count) {f3(num);}
if (mobility[25]>=0) if (!mobility[25] || !count) {b5(num);}
if (mobility[13]>=0) if (!mobility[13] || !count) {b3(num);}
--num;
mobility[21]+=8;
}
++mobility[13];
++mobility[25];
++mobility[17];
++mobility[29];
++mobility[8];
++mobility[10];
++mobility[32];
++mobility[34];
}


The code seems very simple, but it was hard to generate !

Can you help me especially on this part:


int count=0;
if (!--mobility[34]) ++count;
if (!--mobility[32]) ++count;


This can be done as follows:



xor eax,eax ;eax=count
dec dword ptr [mobility+34*4]
jne skip1
inc eax
skip1:
dec dword ptr [mobility+32*4]
jne skip2
inc eax
skip2:


You can assume that mobility[] contains values between -16 and 8.

Any idea ?

JC
Posted on 2003-06-02 12:31:34 by MCoder
Here's my first attempt (not tried it or anything), but it should be faster at least with the if(...) ++count section (see the macro for details).

At least it should give you some ideas ;)



decNinc MACRO num:REQ
dec mob[num * 4]
setnz al
add ah, al
ENDM


d4 PROC num:DWORD
LOCAL count:BYTE
LOCAL pads:BYTE dup 3 ; Not the right syntax, can't remember it

;Put in numerical order for cacheing reasons
dec mob[8*4]
setnz ah

decNinc 10
decNinc 13
decNinc 17
decNinc 25
decNinc 29
decNinc 32
decNinc 34

cmp ah, 2
jge end

mov count, ah
sub mob[21*4], 8
inc num

cmp mob[34*4], 0
jl next1 ; Less than zero, no go
je @F ; Is zero don't worry about count
or ah, ah ; Count must be zero
jnz next1
@@:
invoke e6, num

next1:
cmp mob[32*4], 0
jl next2
je @F
cmp count, 0
jne next2
@@:
invoke c6, num

next2:
cmp mob[10*4], 0
jl next3
je @F
cmp count, 0
jne next3
@@:
invoke e2, num

next3:
cmp mob[8*4], 0
jl next4
je @F
cmp count, 0
jne next4
@@:
invoke c2, num

next4:
cmp mob[29*4], 0
jl next5
je @F
cmp count, 0
jne next5
@@:
invoke f5, num

next5:
cmp mob[17*4], 0
jl next6
je @F
cmp count, 0
jne next6
@@:
invoke f3, num

next6:
cmp mob[25*4], 0
jl next7
je @F
cmp count, 0
jne next7
@@:
invoke b5, num

next7:
cmp mob[13*4], 0
jl next8
je @F
cmp count, 0
jne next8
@@:
invoke b3, num

next8:
inc num ; Not really needed because its incrementing a local that's never used again...
add mob[21*4], 8

end:
inc mob[ 8*4]
inc mob[10*4]
inc mob[13*4]
inc mob[17*4]
inc mob[25*4]
inc mob[29*4]
inc mob[32*4]
inc mob[34*4]

ret
d4 ENDP


Mirno
Posted on 2003-06-02 16:54:17 by Mirno
If you bias the mobility list down by 1 (subtract 1 from each of the original values), you can use the following instead (same number of uOps on a P2/3 but should be faster on older PPlains).
Check for code density as well, if this is smaller I'd go with it.



LOCAL count:DWORD
xor eax, eax

sub mob[blah * 4], 1 ; Can't use dec it doesn't set the carry flag
adc eax, 0


Just replace the "cmp mob, 0" with "cmp mob, -1" and that should all work as expected too.

Mirno
Posted on 2003-06-02 17:21:41 by Mirno
Mirno: good ideas !

I was thinking about the following code:




table: dd 16 dup (0)
dd 1
dd 15 dup(0)

xor eax,eax
mov ebx,[mob+a*4] ;u
mov ecx,[mob+b*4] ;v
dec ebx ;u
dec ecx ;v
add eax, [table+16*4+ebx*4] ;u
mov [mob+a*4],ebx ;v
add eax,[table+16*4+ecx*4] ;u
mov [mob+b*4],ecx ;v


but checking for 0 with a table seems pretty expensive.

JC
Posted on 2003-06-02 17:29:58 by MCoder
I managed to generate full (unoptimized) ASM code.
Here is the a1 routine:


a1:
mov [Knight+edi*4],0
inc ebp
push eax
xor eax,eax
dec [mobility+8*4]
jne skip1
inc eax
skip1:
dec [mobility+13*4]
jne skip2
inc eax
skip2:
cmp eax,2
jae skipa1
mov ebx, [mobility+0*4]
sub ebx,8
mov [mobility+0*4],ebx
inc edi
mov ebx,[mobility+8*4]
test ebx,ebx
jl skip3
je ok3
test eax,eax
jne skip3
ok3:
call c2
skip3:
mov ebx,[mobility+13*4]
test ebx,ebx
jl skip4
je ok4
test eax,eax
jne skip4
ok4:
call b3
skip4:
dec edi
mov ebx, [mobility+0*4]
add ebx,8
mov [mobility+0*4],ebx
skipa1:
inc [mobility+8*4]
inc [mobility+13*4]
pop eax
ret


I get the same speed as my C version with the above routine.
ebp is the Nodes counter, and edi is num.

JC
Posted on 2003-06-02 18:26:46 by MCoder
What's knight tours?
Posted on 2003-06-02 18:36:45 by comrade
Check this site for complete references:
http://www.ktn.freeuk.com/

I don't need Warnsdorf's heuristic.

JC
Posted on 2003-06-02 18:57:22 by MCoder