Please, remove my name from the test, damn you :)
I explicitly stated that I didn't give my own algo I just change KO to
expain him couple things.
As for my algo, I submit it several times already for MASM32 usage,
question to Hutch why he still hasn't include it.
It does more things, amoung converting itself it handles minus sign
and does negate if needed.
Though extra work done it runs faster (in PMMX at least) then all algos
used in KO test.
here is module:


;#########################################################################
; -----------------------------------------
; This procedure was written by Tim Roberts and Svin
; -----------------------------------------
.386
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive
.code

; #########################################################################
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
atodw proc FORCENOFRAME
;uses edi esi String:PTR BYTE

;----------------------------------------
; Convert decimal string into dword value ; return value in eax
;----------------------------------------
;String equ [esp+4]
mov eax,[esp+4]
push esi
mov dl,[eax]
xor ecx,ecx
cmp dl,2Eh
mov esi,eax
sbb edx,edx
mov eax,ecx
adc esi,0
jmp @F
again: lea eax,[eax+4*eax]
inc esi
lea eax,[ecx+2*eax]
@@: mov cl,[esi]
sub cl,30h
jns again
add eax,edx
pop esi
xor eax,edx
retn 4
atodw endp
OPTION PROLOGUE:DefaultOption
OPTION EPILOGUE:DefaultOption
; #########################################################################

end


here is KO test where the module replaces what was in it for m32lib.
I don't know what you see in your systems in my it is the fastest.
Posted on 2002-08-11 19:29:51 by The Svin
Alex,

=================
As for my algo, I submit it several times already for MASM32 usage, question to Hutch why he still hasn't include it.
=================

Its not a matter of laziness or indifference, its a workload that would kill most people and a bandwidth problem with MASM32 as it is taking up many gigabytes per month. With my own site, the log files are about 10 meg each couple of weeks and it is now going over its limits just as a referring site.

I have just finished writing and debugging 250k of code for the new version of Quick Editor and I shortly have to write a compact version of MASM32v7 to try and reduce the bandwidth it is attracting.

What I have in mind is producing a compact version and a service pack for people who have the current version so that there is no need to make the extra download.

The work that you have done already is a valuable contribution to MASM32 that has been useful to many people and I would like to get all of you later versions into both so that everybody can easily get access at them.

I also need to get Vladimir Kim's latest debugging library so that it can be used by people who use MASM32.

Its just the normal problem that 48 hour days are not long enough to get everything done as fast as I would like it to be.

Regards,

hutch@movsd.com
Posted on 2002-08-11 20:08:19 by hutch--

Please, remove my name from the test, damn you :)
I explicitly stated that I didn't give my own algo I just change KO to
expain him couple things.
As for my algo, I submit it several times already for MASM32 usage,
question to Hutch why he still hasn't include it.
It does more things, amoung converting itself it handles minus sign
and does negate if needed.
Though extra work done it runs faster (in PMMX at least) then all algos
used in KO test.
here is module:
<..snip..>
here is KO test where the module replaces what was in it for m32lib.
I don't know what you see in your systems in my it is the fastest.


Hear ye, for this is the law (July 14th, 2001 until December 31st, 2030)
--------------------------------------------------------------------------------
There will be no cursing or swearing, not at each other nor casual. You come here to have fun or as a professional and swearing doesn't fit that category. Also words of certain nature like: w*rez, p*rn, f*ck, etc. could trigger 'webguards' at public institutions to block access to the site, which is something we don't wish to happen of course. (don't try to avoid the swearfilter by using substitute letters such as '@' for a etc... unless you want us to terminate your membership swiftly)
Posted on 2002-08-11 22:23:30 by Qages

Please, remove my name from the test, damn you :)
I explicitly stated that I didn't give my own algo I just change KO to expain him couple things.
Hahaha... :) Certainly - I'll change it to Not Svin. :)

Here is buliaNaza's previous posted effort:


atodw push esi
mov eax,String
mov dl,[eax]
xor ecx,ecx
cmp dl,2Eh
lea esi,[eax-1]
sbb edx,edx
mov eax,ecx
@@ adc esi,1
lea eax,[eax+4*eax]
lea eax,[ecx+2*eax]
mov cl,[esi] ; I'd change this to MOVZX
sub ecx,30h
jns @B
add eax,edx
pop esi
xor eax,edx
ret
So, far this gives the best time once it's in the cache:
	OPTION PROLOGUE:NONE

OPTION EPILOGUE:NONE
AsciiToDw5 proc lpAscii:DWORD
mov ecx, [esp+4]
movzx eax, BYTE PTR [ecx]

movzx edx, BYTE PTR [ecx+1]
cmp edx, '0'
jb _2
lea eax, [eax*4+eax]
lea eax, [eax*2+edx-'0'*11]

movzx edx, BYTE PTR [ecx+2]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]

movzx edx, BYTE PTR [ecx+3]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]

movzx edx, BYTE PTR [ecx+4]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]

movzx edx, BYTE PTR [ecx+5]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]

movzx edx, BYTE PTR [ecx+6]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]

movzx edx, BYTE PTR [ecx+7]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]

movzx edx, BYTE PTR [ecx+8]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]

movzx edx, BYTE PTR [ecx+9]
sub edx, '0'
jb @F
lea eax, [eax*4+eax]
lea eax, [eax*2+edx]
@@:
retn 4
_2:
sub eax,'0'
retn 4
AsciiToDw5 endp
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
:grin: (It is interesting to note Nexo's PROC performs better than the unrolled version above, but has a higher initial cost - which in comparison, it is unable to overcome in ten characters.)
Posted on 2002-08-11 23:19:17 by bitRAKE
Hi all

Well, I am going to use it in a real app with real users, and users are well known to not do what you want them to do. I need ridgid testing, no assumtions and I also need owerflow testing. The app imports text lines to a grid. I am testing with 20 000 lines, two text cells and two integer cells. This means 40 000 integer convertions. Lets say I optimize more and are able to save 20 cycles average. Total saved 800 000 cycles. Todays processors execute 100 000 000 cycles a second. Need I say that as long as I don't use masm32.lib it does not really matter, Here is the result.



DecToBin proc lpStr:DWORD
LOCAL fNeg:DWORD

mov ecx,lpStr
xor eax,eax
xor edx,edx
mov fNeg,eax
cmp byte ptr [ecx],'-'
jne @1
inc ecx
inc fNeg
jmp @1
@@:
cmp eax,214748365
jnb Err
lea eax,[eax*4+eax]
lea eax,[eax*2+edx]
inc ecx
@1:
mov dl,[ecx]
xor dl,'0'
cmp dl,10
jb @b
dec fNeg
je @f
or eax,eax
js Err
ret
@@:
dec eax
js Err
inc eax
neg eax
clc
ret
Err:
xor eax,eax
dec eax
stc
ret

DecToBin endp


KetilO
Posted on 2002-08-12 01:53:42 by KetilO
BitRAKE.Thank you for test! :)
Cheat? :) I spin in my CPU and forgot about upper :(
Ha. The test environment may have a many cheats :)
Ok. I made correction:


AsciiToDw4 proc uses esi,lpAscii:DWORD
mov esi,lpAscii
inc esi
movzx eax,byte ptr [esi-1]
sub eax,'0'
@@: movzx ecx,byte ptr [esi+0]
movzx edx,byte ptr [esi+1]
sub ecx,'0'
sub edx,'0'
cmp ecx,9
ja @F
lea eax,[eax+4*eax]
lea eax,[ecx+2*eax]
cmp edx,9
ja @F
add esi,2
lea eax,[eax+4*eax]
lea eax,[edx+2*eax]
jmp @B
@@: ret
AsciiToDw4 endp


Results (I reanimate Celeron at last :) ):


-------------+--------------
Celeron400 |AthlonXP1700+
-------------+--------------
K T B N | K T B N
1 18 14 16 18 | 19 19 14 15
2 23 23 17 21 | 24 23 23 21
3 26 25 26 25 | 30 27 25 23
4 28 29 26 20 | 34 32 31 25
5 49 43 29 26 | 38 36 33 30
6 50 49 52 33 | 42 44 37 33
7 56 49 54 35 | 46 50 43 39
8 58 56 63 43 | 61 66 45 41
---------------------+--------------


Intel have a strange clock gaps on some algo :)
Posted on 2002-08-12 10:20:49 by Nexo
Nexo, your link doesn't work.
Could you be so kind to explain what is it all about,
and what I did again?
Posted on 2002-08-12 15:54:55 by The Svin
Hi The Svin

It works. Just copy both lines and pate it to the Adress field. It's in Russian so I cant read other than the code. :rolleyes:

KetilO
Posted on 2002-08-12 16:40:41 by KetilO
Thanks Ketil.
It's discussion with Stepan about my algo,
he offers to rearange couple lines to make it shorter in
size (though a couple clocks slower).
I took finally his advise.
We discusses a lot of things with Stepan.

Nexo, you can find better usage of your brain then trying to black me with plagiat.
You never secceed though keep talking your "again"s.
Recalling old talks Stepan even had not had 64 bit dec convertion before I gave him my code. Then we discussed it alot exchanging by mail and in FidoNet.

In your super lybrary carried by Stepan some procs with my changing and ideas, and I don't care that Stepan doesn't mention me, cause main ideas was his.

You were witness when I change his MMX proc converting and he change it (updated) the super library and I mail him about it you, can ask him. Is there my name for the 2 changes? No, and I don't care. Main idea was his and I'm fine with it.
He helps me, I help him.
About 64 bit conversion and atodw main code and idea is mine so my name is here.

He knows all about my activities, so if he cares he would say.
So stop playing his lower here. 'Cause the only effect it would have that I would stop submit any code or stop contact with Stepan, I know you don't care 'cause best of mycode anyway you'll get through his libs.
Is that what you want?

Then you almost secceded - I just start regret that I ever shared my code with anybody.
GoodBye.
Posted on 2002-08-12 17:18:45 by The Svin
The Svin. You can use any code where you want and how you want. You make mark all line own (C) and e.t.c. It is you matter. I does known what about ideas you speak. Because Stepan use own optimizing tools where used a base rules of optimization and trick methods. I want create likewise tools, but need a big experience for this. Stepan can gave a suggest in the form of pie of code, but no terminated optimization code. I glad have such teacher. Really I dont known why you post here this slow (~17-20% from you super speed m32lib version). You inconsistently in code optimization or some other. You can regret shared code. Here a many other good programmers. The loss will be not so big. I still here, because like assembler and good company.
Posted on 2002-08-12 22:02:09 by Nexo
Guys,

I know that both of you are very experienced programmers and I have seen some excellent code from both of you so I would hope that you both can find a way to get on with each other as the whole community benefits from the algorithm design that both of you have contributed.

Surely small differences are not things worth arguing about as there will always be range and variation in assembly coding. I would hate to see either of you offended from things that were said in this forum when both of you have much to contribute.

Regards,

hutch@movsd.com
Posted on 2002-08-13 00:45:44 by hutch--
Branch minimized algo :) It is only for get to know, not for use ;)
Number limits from 0 to 10^8-1


align 8
n2F dq 2F2F2F2F2F2F2F2Fh
n39 dq 3939393939393939h
n30 dq 3030303030303030h

mult label qword
dw 1,0,0,0 ; 1
dw 10,1,0,0 ; 2
dw 100,10,1,0 ; 3
dw 1000,100,10,1 ; 4

mult2 dd 10,100,1000,10000

mov eax,[lpAscii]
pxor mm7,mm7
movq mm0,[eax]
movq mm1,mm0
movq mm2,mm0
pcmpgtb mm1,[n2F]
pcmpgtb mm2,[n39]
psubb mm0,[n30]
pandn mm2,mm1
pand mm0,mm2
pmovmskb eax,mm2
not eax
bsf ecx,eax
test eax,11110000b
je @F
punpcklbw mm0,mm7
pmaddwd mm0,[mult+8*ecx-8]
pshufw mm1,mm0,1110b
paddd mm0,mm1
movd eax,mm0
ret
@@: movq mm1,mm0
punpcklbw mm0,mm7
punpckhbw mm1,mm7
pmaddwd mm0,[mult+3*8]
pmaddwd mm1,[mult+8*ecx-8*5]
pshufw mm2,mm0,1110b
pshufw mm3,mm1,1110b
paddd mm0,mm2
paddd mm1,mm3
movd edx,mm0
movd eax,mm1
imul edx,[mult2+4*eax-4*5]
add eax,edx
ret


*Edit* little bug fixed :)
Posted on 2002-08-13 09:57:53 by Nexo
The loss will be not so big.


:)
I just wondered when at the end you'd say it.

OK Mr.Nexo I'll make you try your own medicin when I have a little time.
That is the promise.
Posted on 2002-08-13 13:28:25 by The Svin
I modificate test.

Nexo,
in your modified test, running on my machine your proc Still shows THE SLOWEST results.
For all strlengths!
And replaced m32lib version with what I posted here (you called it super slow) - the fastest.
Taking in mind your age, I will not make laugh on you, just give you a possible
explonation:
Most likely you test (and as susequense optimize) for Althon.
And what is fastest on Althon appeared to be slowest on original Pentium.
I'll give you my position:
I write for Pentium, for me Althon doesn't exist.
Not because I hate Althon, Oh NO, I absolutly indifferent to it.
There is very simple explonation:
all machines that run my code are Pentiums (from PMMX to PIII).
If in the future I'd need to write for Althon - I'd study it. Until the time I don't care.

So next time you are going to say anything about wich is slow or fast:
just remember that you looking to the world through horse-blinders named "Althon".
And what is true for Althon might be not true for brand Pentium.

I mean several times you gave me advices to change my code, and all this times
your modifications ran slower on my machine then my original.

Take care :)
Posted on 2002-08-13 14:01:33 by The Svin
Nexo, I think of same algo with MMX. :) Very good.
Posted on 2002-08-14 00:11:41 by bitRAKE


all machines that run my code are Pentiums (from PMMX to PIII).
If in the future I'd need to write for Althon - I'd study it. Until the time I don't care.

It is you problems.


just remember that you looking to the world through horse-blinders named "Althon".
And what is true for Althon might be not true for brand Pentium.

I remark my proc what is special for Athlon. Get to know a problem of stability of tests. But I made test also on Intel Celeron. Celeron 400Mhz is family of Pentium II processors (or you dont known about it?). I dont understand why you post it.

If you write code for Pentium II and Pentium III (you write it above), then you must known how work it processors. Pentium and Pentium MMX have a very restriction possibility of superscalar execution. Other Pentuim processor have more piplines and at the same time execution units. For more learn you can look on results of VTune Prefomance Analyzer (i use it for optimization my algo!). And compare two algo. You presented algo will have very more stalls. It is true for all Intel CPU from Pentium Pro to Pentium 4 (it is majority of all model of Intel). I dont say about AMD. And me hardly consider such rarely CPU (Pentium & PMMX) in optimization. I lose my PMMX166 in trash can.

In the last you test (file ta2w.zip) you make some strange change. All look here:


AsciiToDw2 proc lpAscii:DWORD

mov ecx,lpAscii
xor eax,eax
xor edx,edx
; jmp @1
@@:
lea eax,[eax*4+eax] ;Multiply by 5
lea eax,[eax*2+edx] ;Multiply by 2 and add digit in edx
inc ecx
@1:
mov dl,[ecx]
sub dl,'0'
jb @f
cmp dl,10
jb @b
@@:
ret

AsciiToDw2 endp

I think it is not work correctly on ANY processor.
I also more disappointed in you :(
You can continue make laugh on me.
Posted on 2002-08-14 11:34:04 by Nexo
Why the jb @f?
1 less instruction:

[b]	mov	dl, [ecx]

sub dl,'0'
cmp dl,10
jb @b[/b]
Posted on 2002-08-14 12:39:10 by iblis
Nexo,

If I have learnt this much, if you have to write algorithms that are general purpose, you must test them on both Intel and AMD and average the end result across both.

I keep an older AMD here to test against my PIII as they were about similar period technology. I have done some work on my PIV and it is different code to the PIII and to get later design algos going properly, they would need to be tested on an Athlon as well. The PIII is the easiest code to mess up and get bad results for, it behaves differently to older Intel processors and different to the PIV as well.

When I last did some serious work that had to work on both, Intel processors had better branch prediction but had bad penalties for pipeline stalls, AMD had a shorter pipeline and a lower stall penalty but was slower in unpredictable branching.

Regards,

hutch@movsd.com
Posted on 2002-08-14 22:24:07 by hutch--

If I have learnt this much, if you have to write algorithms that are general purpose, you must test them on both Intel and AMD and average the end result across both.

Yes, you right. You can look on Intel and AMD results of tests in this thread of my posting. But I can't made tests on Intel 386,486,Pentium,PMMX and 386,486,586,K5 processors. In any case I prefer have several algo for different processors and make build of application on target CPU (or blind). I dont support oldest (look above), because it is rarely (they have low frequency). May be you make optimization also for Transmeta? I do then it is needs. Why you choose Intel and AMD for average tests? I think same causes for forget testing on oldest CPU.

Thanks.
Posted on 2002-08-15 10:48:21 by Nexo