Ehh, it's wellknown fact that P4 doesn't like inc,dec
But in this case I don't care.
You see the algo itself is already fast enough, so I'd like to see speed optimizing in the case only if it doesn't make opcode longer.
You can, for instence, just put ALIGN 16 at loops start and it itself would make the code faster.
If you'll make it 50% slower (not of cause 200 times slower :)) yet 50% shorter - this would be better.
As I said before - it's compromize between size and speed.
BlackMirror version ~ 10-15% slower than mine, yet it's almost twice shorter. I took it as a great improvemet.
It uses the same general algo, but better desinged and with shorter instruction.
You see it's one of kinda "tables filling procs" that used in apps that havilly used "table lookups" approches.
In some those apps the tables just included directly in binaries, in others for size sake they are replaced by "table filling procs". Those procs works just at start of apps (there might be a hundred of such procs) and if they themself are big ones, thogh fast - it make no sence to use them at all - since the fastest way to fill table - just include it in bin file :)
Of cause they shouldn't be too slow, but without changing the algo itself - it's just impossible to make the proc slow, 2000 ticks of 3000 ticks - it doesn't matter here, just don't make it 2000000 ticks :)
For example your changing rep stosd for 3 movs OUTSIDE loop!! - absolute nonsence,
what did you it for - paying a byte for 1 clock?

Again - speed improvements are wellcome, but only if they don't increase size.
Posted on 2004-07-07 03:04:52 by The Svin

Ehh, it's wellknown fact that P4 doesn't like inc,dec
But in this case I don't care.
You see the algo itself is already fast enough, so I'd like to see speed optimizing in the case only if it doesn't make opcode longer.
You can, for instence, just put ALIGN 16 at loops start and it itself would make the code faster.
If you'll make it 50% slower (not of cause 200 times slower :)) yet 50% shorter - this would be better.
As I said before - it's compromize between size and speed.
BlackMirror version ~ 10-15% slower than mine, yet it's almost twice shorter. I took it as a great improvemet.
It uses the same general algo, but better desinged and with shorter instruction.
You see it's one of kinda "tables filling procs" that used in apps that havilly used "table lookups" approches.
In some those apps the tables just included directly in binaries, in others for size sake they are replaced by "table filling procs". Those procs works just at start of apps (there might be a hundred of such procs) and if they themself are big ones, thogh fast - it make no sence to use them at all - since the fastest way to fill table - just include it in bin file :)
Of cause they shouldn't be too slow, but without changing the algo itself - it's just impossible to make the proc slow, 2000 ticks of 3000 ticks - it doesn't matter here, just don't make it 2000000 ticks :)
For example your changing rep stosd for 3 movs OUTSIDE loop!! - absolute nonsence,
what did you it for - paying a byte for 1 clock?

Again - speed improvements are wellcome, but only if they don't increase size.


I completely disagree. To me optimization is all about the speed. If you have run your C code on a profiler, and you found such and such routine runs slow, then I want to make it run as fast as possible to make the overall app run faster. Now you might avoid using lookup tables in your optimization if your app churns through a lot of data Trading a rep stosd for 3 MOVes is a much better solution. Adding a few bytes doesn't really effect the over all app that much and it does make it run faster. I have developed large apps competely in assembler for my job ( I write BIOS), and I write asembler code daily for BIOS.
Posted on 2004-07-07 07:48:25 by mark_larson

bitRake, have you posted something here or it's just something wrong with my email borad notification?
I reply XADD. ;)

I totally agree with the middle way - fast but small. It is not an easy way to code, but it would be no fun if it was. The idea that resources are not limited is absurd. All resources are limited and in general we should code accordingly.

IMHO, of course.
Posted on 2004-07-07 08:16:47 by bitRAKE

I reply XADD. ;)

I totally agree with the middle way - fast but small. It is not an easy way to code, but it would be no fun if it was. The idea that resources are not limited is absurd. All resources are limited and in general we should code accordingly.

IMHO, of course.


I didn't say resources were unlimited ( look at the comment about not using look up tables if you churn through a lot of memory), however saving 4 or 5 bytes in the code by using rep stosd when it is slower to me isn't a good reason. Now saving a 64MB lookup table because the user doesn't have that much memory, or you go through a lot of data is different.

EDIT: Let me clarify that some more

1) to me adding a 4 or 5 bytes to a routine to make it faster is definitely ok and worth it.
2) I don't think you should program like you have unlimited resources. Definitely not. I actually talked about this on masmforum. I wrote a tutorial on optimizing a ascii to radix40 routine. In the end I had it running at 300 cycles on a P4. However I showed an alternate version as a joke, that used a 64MB lookup table that ran in 16 cycles. I joked that who in their right mind would use a 64MB lookup table to convert ascii to radix40? It's a balancing act. You have to look at how much resources you have, and program for that. A few bytes isn't going to make a difference. I actually keep 2 copies of all my routines that have lookup tables. A have a second copy that does the same thing but without a lookup table. Lookup tables usually end up running faster but have a few problems. One of which is resources and the other is if you go through a lot of data in your app you are going to get a lot of cache misses if you have a lot of lookup tables. As a side note I also keep multiple copies of routines that use SSE2. I usually try and have an SSE or MMX version as well. For instance I have a random number routine that uses SSE2 that generates a random number every 12 cycles on a P4. It actually does it in 4, but the other 8 cycles are used to scale it to a base. If you only need a 32-bit number then it runs in 4 cycles. You can't run it on systems without SSE2 support, so I have a similar MMX one that runs in 24 cycles.
Posted on 2004-07-07 09:40:37 by mark_larson
mark_larson
Nothing personal, but those talks a la "how proffecional asm programmer should code" would lead to nothing, but flames here.
To me as a former proffecional low-level coder, most of your statements look very strange as said from another "professional" in the field, but I kept the notions for myself and kept quiet since it nothing to do with the topic.
As to something related to topic, let's say some facts to make picture clear. Your statements about 64 mb look up tables don't feat here:
1. Max table size possible to fill with the proc is 33 diag.
So it's 1+2+3..+33=33*34/2=561 bytes
So it's not so big look up table, and making very big proc to fill it could eliminate sence of such a proc comparing with alternatives to the table as binary data.
2. In other hand the having the table itself as alternative to calculating very havily used in many fields formulea
n!/((n-x)!x!) for x element in n raw makes a lot of sence in particular if code are going to call the calculation frequently.
It's nothing to do with your unfortunate example about radix 40 convertion, it's times exponantionally harder to calculate. Nevertheless I gave another proc that calulate it (x element in n raw) without table and your are wellcome to optimize or offer even another algorithm to calc it.

To All: I firstly thought that it would be interesting for those of us who "math inclined" :)
I new there are lot of math smarty ones who could improve my x in n proc (wich actually calcs x!/((x-n)!n!), and we all could have a fun :))
Posted on 2004-07-07 14:14:31 by The Svin

mark_larson
Nothing personal, but those talks a la "how proffecional asm programmer should code" would lead to nothing, but flames here.
To me as a former proffecional low-level coder, most of your statements look very strange as said from another "professional" in the field, but I kept the notions for myself and kept quiet since it nothing to do with the topic.
As to something related to topic, let's say some facts to make picture clear. Your statements about 64 mb look up tables don't feat here:
1. Max table size possible to fill with the proc is 33 diag.
So it's 1+2+3..+33=33*34/2=561 bytes
So it's not so big look up table, and making very big proc to fill it could eliminate sence of such a proc comparing with alternatives to the table as binary data.


You misuderstood. I have no problems with your table size. It is not too big. I was using the 64MB as an example of a lookup table that is too big.


Originally posted by The Svin In other hand the having the table itself as alternative to calculating very havily used in many fields formulea
n!/((n-x)!x!) for x element in n raw makes a lot of sence in particular if code are going to call the calculation frequently.
It's nothing to do with your unfortunate example about radix 40 convertion, it's times exponantionally harder to calculate. Nevertheless I gave another proc that calulate it (x element in n raw) without table and your are wellcome to optimize or offer even another algorithm to calc it.

To All: I firstly thought that it would be interesting for those of us who "math inclined" :)
I new there are lot of math smarty ones who could improve my x in n proc (wich actually calcs x!/((x-n)!n!), and we all could have a fun :))


That's what I was saying about doing 2 versions. Having one that does a table lookup and one that doesn't. That way if you have a program that goes through a lot of data, you can use the non-lookup table one to help reduce cache misses.
Posted on 2004-07-07 16:02:04 by mark_larson
If you mean this particular task, non look up table proc though can be used to avoid cache misses any way will be much slower. Just compare time to calc
n!/(n-x)!x! with penalty for cache miss.
Enough to say that bigest n! for 32 bit var would be 12!.
So you'd also need in the case special time consuming procs to work with very long integers
(for example 32! =~2,6313083693369353016721801216e+35"
Posted on 2004-07-07 18:22:29 by The Svin

If you mean this particular task, non look up table proc though can be used to avoid cache misses any way will be much slower. Just compare time to calc
n!/(n-x)!x! with penalty for cache miss.
Enough to say that bigest n! for 32 bit var would be 12!.
So you'd also need in the case special time consuming procs to work with very long integers
(for example 32! =~2,6313083693369353016721801216e+35"


I sped up the Blackmirror code. Here is the original code. It runs in 9252 cycles on my P4. On my P3 I think it around in the 3000 cycle range. I will have to re-run it at work tomorrow to get an exact baseline.



;PascalD2:;(lpMemory, dCount)
pusha
mov esi,[32+esp+4];lpMemory
mov edi,esi
jmp _l2
_l0: lodsd
xadd eax, edx
stosd
cmp esi,ebp
jb _l0
_l2: xor eax,eax
xor edx,edx
inc eax
stosd
dec dword ptr [32+esp+8];dCount
mov ebp,edi
jnz _l0
popa
retn


To speed it up I changed a number of things around.

1) I got rid of the "complex" instructions. Complex instructions are instructions that do more than one thing in an instruction. That's the stosd, lodsd, and XADD. That's where I got most of the speed up from. When the Pentium came out they recommended not using them anymore ( except in the case of using REP with the string instructions), because you were supposed to do more RISC like programming.

2) Moved dCount to a register to get rid of a memory access every loop when they decrement the loop counter.

3) Got rid of one of these instructions "xor eax,eax", "xor edx,edx", "inc eax", and made it just 2 instructions "mov eax,1", "xor edx,edx".

4) I was going to try unrolling the inner loop, but I haven't had a chance yet.

This new code runs in 2360 cycles. If I remember right this version I am posting now is also faster on a P3, but you don't get as much of a speed improvement as I did in this case. The original unoptimized Blackmirror code ran in 9252 cycles. That is 9252 / 2360 = 3.92 times faster. The reason it runs in more clock cycles on a P4 than on a P3 ( just looking at the cycle count, the higher processor speed of the P4 will help it run faster), has to do with how slow they made some ALU instructions on the P4. So I avoided them ( the ones listed above). I didn't comment the code, instead I explained the differences above. Those are the only changes I made.



;PascalD2:;(lpMemory, dCount)
pusha
mov esi,[32+esp+4];lpMemory
mov ecx,[32+esp+8]
mov edi,esi
jmp _l2
_l0:
mov eax,[esi]
add esi,4
mov ebx,eax
add eax,edx
mov edx,ebx
mov [edi],eax
add edi,4

cmp esi,ebp
jb _l0
_l2:
mov eax,1
xor edx,edx
mov [edi],eax
add edi,4

dec ecx
mov ebp,edi
jnz _l0
popa
retn

Posted on 2004-07-07 20:01:42 by mark_larson
There is an additional optimization which you can do which I have not done yet. My life is starting to get busy, so I probably won't be able to finish it. But it should give a great speed up ( close to 2x). This is an "algorithmic" change. I finally look at how a Pascal triangle is generated from this web page. If you look at the webpage you will notice something interesting. Every row is symmetric!!! You can use that to only generate HALF the entries for a row and just fill in the rest.

http://mathworld.wolfram.com/PascalsTriangle.html

I wrote some code that has some bugs to do this. Basically I freed up another register, and when I do a "mov ,eax" to fill in one entry in a row, I also do a "mov ,eax" to fill out the opposite end of the same row. I freed up ESP by moving it to an MMX register. Just make sure you aren't accessing anything on the stack or doing any push/pops after you save it. I had already moved dCount to a register, and that was the only stack access in the loop.



movd mm0,esp

... code

movd esp,mmo
Posted on 2004-07-08 10:36:35 by mark_larson
If you look at the webpage you will notice something interesting. Every row is symmetric!!!
You might have noticed it looking at test app :)
The thing why the fact hasn't been used to speed up proc - again, not to make the proc bigger.
Posted on 2004-07-08 12:09:50 by The Svin
; dCount = [0,34]

; 35 bytes
PascalD PROC lpMemory, dCount
xor eax, eax
mov edi, lpMemory
inc eax
mov esi, edi
jmp _3

_1: cdq
_2: lodsd
xadd eax, edx
cmp esi, ebp
stosd
jne _2
mov eax, edx
_3: stosd
dec dCount
mov ebp, edi
jns _1
retn 8
PascalD ENDP
This PROC was developed from the inside out (start at inner loop and build around), and it is almost the same as BlackMirror's. I was not able to make use of half table. I wish it was 32 bytes or less (which is not a problem if inlined).
Posted on 2004-07-09 07:24:25 by bitRAKE
Originally posted by bitRAKE
 ;code snipped
This PROC was developed from the inside out (start at inner loop and build around), and it is almost the same as BlackMirror's. I was not able to make use of half table. I wish it was 32 bytes or less (which is not a problem if inlined).



I think you guys are being a bit too sensitive on code size for saving a few bytes. When you get into saving in the KB range or MB range then that matters. Heck at these small byte values, my EXE comes out the same size in both cases. Let me show you what I mean. It has to do with how the linker makes the program into an EXE. Here is my optimized blackmirror code ( not the half a row version)



;PascalD2:;(lpMemory, dCount)
;smallest dCount you can use is 3. For 3 rows
pusha
cmp dCount,3
jl exit
mov esi,[32+esp+4];lpMemory
mov ecx,[32+esp+8]
mov edi,esi
mov eax,1
dec ecx ;we do 2 rows to begin with, so we want to start one lower.
mov [edi],eax
mov [edi+4],eax
add edi,8
add esi,4
jmp _l2
_l0:
mov eax,[esi]
add esi,4
mov ebx,eax
add eax,edx
mov edx,ebx
mov [edi],eax
add edi,4
cmp esi,ebp
jb _l0
_l2:
mov eax,1
xor edx,edx
mov [edi],eax
add edi,4
dec ecx
mov ebp,edi
jnz _l0
exit:
popa
retn


The OBJ is 58,400. The EXE is 54,272. Just to let you know I have more code in their than just these 2 routines. I use this program for optimizing all my code. So it has a lot of other code in it. But we only care about how much bigger the EXE is from using the 2 different algorithms.

Here is the original blackmirror code.


;PascalD2:;(lpMemory, dCount)
pusha
mov esi,[32+esp+4];lpMemory
mov edi,esi
jmp _l2
_l0:
lodsd
xadd eax, edx
stosd
cmp esi,ebp
jb _l0
_l2:
xor eax,eax
xor edx,edx
inc eax
stosd
dec dword ptr [32+esp+8];dCount
mov ebp,edi
jnz _l0
popa
retn


It's OBJ is 58,324 ( mine was 58,400, that means the code I added added 76 bytes). The EXE size is 54,272. Wait, wasn't MY algorithm that was a WHOLE 72 bytes longer ALSO 54,272? I am stunnnnnnned. How could that happen? hehhe
Posted on 2004-07-09 09:29:45 by mark_larson
mark_larson, don't assume ignorance just because the coding style doesn't fit your desires. Please, try to entertain the aspect of the thread. It has already been stated that the algortihm is not even required, and that speed is not the primary goal.


I think we might be able to remove a branch if each line of the table is allowed to end with a zero. Or how about accessing the table with X(X-1)/2 and allowing the ones to overlap:

1,1,2,1,3,3,1,4,6,4,1,5,...

This seems to reduce the algorithm/table nicely. :)
Posted on 2004-07-09 09:41:56 by bitRAKE
Overlapping ones seems a good idea.
Icluding situation when there is no table filling proc, and table is included in binary data section.
Posted on 2004-07-09 13:56:25 by The Svin

mark_larson, don't assume ignorance just because the coding style doesn't fit your desires. Please, try to entertain the aspect of the thread. It has already been stated that the algortihm is not even required, and that speed is not the primary goal.


I think we might be able to remove a branch if each line of the table is allowed to end with a zero. Or how about accessing the table with X(X-1)/2 and allowing the ones to overlap:

1,1,2,1,3,3,1,4,6,4,1,5,...

This seems to reduce the algorithm/table nicely. :)


I wasn't assuming ignorance. When I put the "heehee", I was doing it in a "teasing" fashion, not in a "you are ignorant" fashion. I was merely pointing out that worrying over 72 bytes doesn't net you anything since the EXE size ends up being the same anyways. And I thought the purpose of the thread was to optimize the code and make it run fast. If you want to reduce the size, you can save an additional byte by using the LOOP command. It's one byte smaller than DEC/JNZ
Posted on 2004-07-09 14:22:22 by mark_larson
In case of X = 0, X(X-1)/2 gives wrong offset.
I think one more 1 at the start could solve it.
Posted on 2004-07-09 14:34:32 by The Svin
is the proc recursive?
do you care about the size of the proc for the sport of it or because it will be called multiple times and instanciate some stack/heap mem each time?

if its the case you could also make a nonrec version of it with an associated data struct to keep track of the different calls... i ve tried to explain what i meant here:

http://www.asmcommunity.net/board/showthread.php?threadid=18705

but it seems it wasnt very clear since noone really answered. :)

bye
Posted on 2004-07-09 14:55:28 by HeLLoWorld

In case of X = 0, X(X-1)/2 gives wrong offset.
I think one more 1 at the start could solve it.
How so?

f(x) = x(x-1)/2

f(0) = 0

lpMemory[0] = 1

?
Posted on 2004-07-09 15:10:59 by bitRAKE

I wasn't assuming ignorance. When I put the "heehee", I was doing it in a "teasing" fashion, not in a "you are ignorant" fashion. I was merely pointing out that worrying over 72 bytes doesn't net you anything since the EXE size ends up being the same anyways. And I thought the purpose of the thread was to optimize the code and make it run fast. If you want to reduce the size, you can save an additional byte by using the LOOP command. It's one byte smaller than DEC/JNZ
I am saying if someone wants to save some bytes then saving them does net something. And the challenge can teach things that you might not learn otherwise. Coding can be done for many reasons. Show me how LOOP makes the code smaller?
Posted on 2004-07-09 15:29:28 by bitRAKE

How so?

f(x) = x(x-1)/2

f(0) = 0

lpMemory[0] = 1

?


Stupid me :)
I was Just naivly buffled that X=1 and X=0 would gives the same offs, wich in the case overlapping format is correct.
Posted on 2004-07-09 15:36:13 by The Svin