Just a try to find compromise between size and speed in filling subj.
Source and test app. included
Posted on 2004-07-05 19:06:46 by The Svin
little improvement


sbb eax,eax
jc @r
inc eax

instead of


sbb eax,eax
jc @r
movri eax
Posted on 2004-07-05 19:15:59 by The Svin

Just a try to find compromise between size and speed in filling subj.
Source and test app. included


The rar file produced errors when I tried extracting it.

And what the heck is a "movri" instruction?
Posted on 2004-07-05 20:53:22 by mark_larson
it's macro, you'll understand it when look at source
Posted on 2004-07-05 21:01:15 by The Svin
Another zipped version
Posted on 2004-07-05 21:02:09 by The Svin

Another zipped version


Just some constructive criticism. Nothing personal. I program in assembler for a living.

1) Having the macro "movri" do multiple things is very confusing.

2) If the person doesn't know the trick about doing "XOR REG,REG", doing it within that macro tends to hide the fact.

3) You have a bug in your code. You do a "pusha" at the top of the PascalD routine, but then you use 32-bit registers. "pusha" doesn't push 32-bit registers, you want to use "pushad".

4) Generally windows doesn't care about eax, ecx, or edx. So using a PUSHAD is probably excessive. Probably best to just push the remaining registers you use.

5) Using LEA is very slow on the P4, try and avoid it.

6) Using LEA to set a register to 2, and another one to -2 is very confusing. It's hard to follow confusing code. Since it runs slower on a P4 anyways, might as well use a MOV instruction. Or if you want to be fancy you can do it with an OR instruction since ECX at the time is already 0. This is unless you are doing some sort of optimization. Sometimes optimziations make the code more confusing. But in the case of optimizations, it is always best to write the code initially so that it is easy to understand. That way it makes it easier to modify it later using different optimizations.


rep stosd
; edi=B
lea ebx,[ecx][2] ;ebx = 2
lea edx,[ecx][-2] ;edx = -2



7) I have no clue how Pascal triangles work. Reading your comments I should be able to figure it out. Try commenting what the "algorithm" does and not so much what a particular assembler command does. It gives people a better overall picture, and let's them understand the code better.
Posted on 2004-07-05 21:29:32 by mark_larson
You do a "pusha" at the top of the PascalD routine, but then you use 32-bit registers
For masm if you compile 32 addr module pusha and pushad are the same.

Using LEA is very slow on the P4, try and avoid it.

You forgot to add "if you are aiming for P4" :)

Generally windows doesn't care about eax, ecx, or edx. So using a PUSHAD is probably excessive. Probably best to just push the remaining registers you use

Well, I'm quite aware of what Windows care.
As for what I as programmer could care also :)
As I stated it's a compromise between size and speed.
It's to fill Pascal tr. table, for altimate speed you could completly avoid using the proc at all, since the table itself could be included just as it is in some data section.
The aim of proc it to replace bytes in file image module for the table by including bytes the proc and calling it at start of app once, then using created in memory table as many times as needed. Thus pusha popa is used 'cause they cost just a byte each, instead of bytes for separate pushpop.
Since most time is spend in loops, I optimized them for speed (I replaced speed consuming n!/(n-x)!x! with just one add :)) instead of prologue and epilogue code where used size optimizing.
About "readability and confusing" well... what can I say,
it's a matter of experience, I used code for living too, so nothing personal from my side also :)

Actually, it is the first version after I fiqured that triangle could be made in simple recurrent form as sum of 2 previously found elements. And I'm sure there must be room for both side and speed optimization. So one is wellcome to try.
As about Pascal triangle it can be used in many things,
for example in combinatorics (how many 1 x bits numbers in n bits digits for example) and algebra (binomal coefficients - I'm not sure about the term in English)
Posted on 2004-07-05 22:15:09 by The Svin
Or if you want to be fancy you can do it with an OR instruction since ECX at the time is already 0
How can use the fact that ecx = 0 using OR? to place
2 in ebx and -2 in edx?! Might be you mean something like


or ecx,2
mov ebx,ecx
mov edx,ecx
neg edx

Is that what you meant?
But it's 3 bytes longer! And I'm not sure it's faster in p4
for sure it's slower in p3.
If lea as slow as you stated in P4 yet I would use
push 2
push -2
pop edx
pop ebx
since it's 6 bytes instead of 9 that you proposed, and for sure it's would be faster since there are less dependences.
Posted on 2004-07-05 22:32:40 by The Svin

Or if you want to be fancy you can do it with an OR instruction since ECX at the time is already 0
How can use the fact that ecx = 0 using OR? to place
2 in ebx and -2 in edx?! Might be you mean something like


or ecx,2
mov ebx,ecx
mov edx,ecx
neg edx

Is that what you meant?
But it's 3 bytes longer! And I'm not sure it's faster in p4
for sure it's slower in p3.
If lea as slow as you stated in P4 yet I would use
push 2
push -2
pop edx
pop ebx
since it's 6 bytes instead of 9 that you proposed, and for sure it's would be faster since there are less dependences.


Naw. I meant just exchange the LEA's with ORs or MOVs. Since ECX is already 0. I am just doing the 2 part, the -2 would be the same.



or ecx,2
;OR
mov ecx,2
;OR
or cl,2 ; compiles smaller
;OR
mov cl,2
Posted on 2004-07-05 22:44:35 by mark_larson

You do a "pusha" at the top of the PascalD routine, but then you use 32-bit registers
For masm if you compile 32 addr module pusha and pushad are the same.

Using LEA is very slow on the P4, try and avoid it.

You forgot to add "if you are aiming for P4" :)

I do general optimizations for all processors, just so that my code will be fast on anything you run it on. Sometimes you can't do that. But for LEA it's easy because their are a number of alternatives. Unfortunately Intel really dropped the ball when doing the P4. A number of instructions that used to be fast that the assembler programmer used to make his code fast are now slow. Adc, sbb, shifts, rotates, and lea's. So I try to avoid them when possible or do other work arounds.

Originally posted by The Svin

Generally windows doesn't care about eax, ecx, or edx. So using a PUSHAD is probably excessive. Probably best to just push the remaining registers you use

Well, I'm quite aware of what Windows care.
As for what I as programmer could care also :)
As I stated it's a compromise between size and speed.
It's to fill Pascal tr. table, for altimate speed you could completly avoid using the proc at all, since the table itself could be included just as it is in some data section.
The aim of proc it to replace bytes in file image module for the table by including bytes the proc and calling it at start of app once, then using created in memory table as many times as needed. Thus pusha popa is used 'cause they cost just a byte each, instead of bytes for separate pushpop.
Since most time is spend in loops, I optimized them for speed (I replaced speed consuming n!/(n-x)!x! with just one add :)) instead of prologue and epilogue code where used size optimizing.
About "readability and confusing" well... what can I say,
it's a matter of experience, I used code for living too, so nothing personal from my side also :)

Actually, it is the first version after I fiqured that triangle could be made in simple recurrent form as sum of 2 previously found elements. And I'm sure there must be room for both side and speed optimization. So one is wellcome to try.
As about Pascal triangle it can be used in many things,
for example in combinatorics (how many 1 x bits numbers in n bits digits for example) and algebra (binomal coefficients - I'm not sure about the term in English)
Posted on 2004-07-05 22:58:10 by mark_larson



Naw. I meant just exchange the LEA's with ORs or MOVs. Since ECX is already 0. I am just doing the 2 part, the -2 would be the same.



or ecx,2
;OR
mov ecx,2
;OR
or cl,2 ; compiles smaller
;OR
mov cl,2

But I need 2 in ebx and -2 in edx, not 2 or -2 in ecx
Posted on 2004-07-05 23:24:24 by The Svin
Another proc about Pascal triangle to x element in n diagonal. It's much slow then fast filling table and using then the table for many ocasions, yet it doesn't use as much memory and doesn't calc factorial directly,


xPosINnDiag proc x,n
;save registers if it is needed
mov ecx,n
test ecx,ecx
je @r1
cmp eax,x
sbb eax,eax
jc @r
inc eax
mov ebx,eax

@@: mul ecx
dec ecx
div ebx
inc ebx
dec x
jne @r
@r:
ret
@r1: inc eax
jmp @r
xPosINnDiag endp
Posted on 2004-07-05 23:28:28 by The Svin
Another version PascalD by BlackMirror


PascalD2:;(lpMemory, dCount)
pusha
mov esi,[32+esp+4];lpMemory
mov edi,esi
jmp _l2
_l0:
lodsd
xchg eax,edx
add eax,edx
stosd
cmp esi,ebp
jb _l0
_l2:
xor edx,edx
xor eax,eax
inc eax
stosd
mov ebp,edi
dec dword ptr [32+esp+8];dCount
jnz _l0
popa
retn
Posted on 2004-07-05 23:40:45 by The Svin


But I need 2 in ebx and -2 in edx, not 2 or -2 in ecx


It was a typo. I meant to put EBX and EDX. Simply changing the code I posted above to use EBX or EDX is trivial.
Posted on 2004-07-05 23:45:00 by mark_larson
But ebx and edx is not 0!
You said that since ecx = 0 I could use or to place 2 or -2 but in code 0 is in ecx, and just happend because of rep, it couldn't happen because of rep in ebx or edx!
Posted on 2004-07-05 23:57:01 by The Svin
little improvement for BlackMirror proc
xchg eax,edx
add eax,edx
we change to
xadd eax,edx
Posted on 2004-07-05 23:58:11 by The Svin
ret in BlackMirror code need to be replaced by ret 8
Posted on 2004-07-06 01:08:01 by The Svin

or ecx,2
mov ecx,2
;OR
or cl,2 ; compiles smaller

or ecx,2 or cl,2
it's the same size 'cause opcode format include bit s.
Posted on 2004-07-06 01:10:38 by The Svin
bitRake, have you posted something here or it's just something wrong with my email borad notification?
Posted on 2004-07-06 05:10:35 by The Svin
I didn't notice all the posts on this thread until lunch time. I sped up your code by 20% on the P4, but I am not done yet. I am going to post my unfinished result, so you can see what I am talking about. I haven't done heavy testing so their might be a bug. I also noticed one bug in your code while doing the optimizations. You never set up EDX before doing the comparison.



cmp edx,ecx ; count < 3?


Your code runs in 2796 cycles on my P4. The new code runs in 2330 cycles. That is 20% faster. I am not going to go into much extensive detail about what I speed up, because I have to get back to work. I am on my lunch break. I would be willing to bet the code actually runs slower on a P3 or older Intel processor. Things I removed: sbb, lea ( where possible), exchanged inc/dec for add/sub ( add/sub runs in 0.5 cycles, inc/dec runs in 1 cycle). I got rid of the REP STOSD. It is only writing 12 bytes, and when you are moving that little data often it is quicker to just do it manually ( saved me 8 cycles doing it this way). The biggest jumps in speed came from getting rid of the LEA for 2 reasons. 1) I was able to break setting up the cpu register up over multiple lines, and so was able to break up more dependencies with other calculations ( when you remove the LEA instruction it takes multiple ALU instructions to simulate what it does). 2) LEA runs slow on the P4.

for instance when I was setting up ESI, I broke up the dependencing with REP STOSD and EAX ( Before I removed the rep stosd code that is). I added which LEA I was getting rid of below in a comment.



mov eax,1
; lea esi,[edi][edx*4] ;esi=R
mov esi,edi ;set up ESI up here, to help break up dependencies between the MOV EAX,1 and the REP STOSD.
rep stosd
mov ebx,2
mov ecx,-2
; lea esi,[edi][edx*4] ;esi=R
sub esi,4


Here is the code. Again I am not done optimizing. There are still a few tricks I want to try. But I wanted to post this now.



pusha
movri ecx,3
mov edx,dCount
mov edi,lpMemory
cmp edx,ecx ;count < 3?
jl @r
movri eax,1
mov esi,edi
mov [edi],eax
mov [edi+4],eax
mov [edi+8],eax

; edi=B
mov ebx,2
mov ecx,-2
sub esi,4
align 4
@outer1:
mov ecx,1

sub edx,1 ;edx=-ebx-1
mov dword ptr [edi],ecx
sub esi,4
align 4
@inner2:
mov eax,[esi][ecx*4] ;eax = R[i]
add eax,[esi][ecx*4][4] ;eax =R[i]+R[i+1]
mov [edi][ecx*4],eax ;B[i]=R[i]+R[i+1]
add ecx,1
cmp ecx,ebx
jne @inner2
mov dword ptr [edi][ecx*4],1
lea edi,[edi][ecx*4][4]
add ebx,1
cmp ebx,dCount
jne @outer1
@r:
popa


Happy Hannukah!
Posted on 2004-07-06 13:24:59 by mark_larson