;Fastest DWORD to ASCII HEX code by The Svin//Russia
;fastadw
;in = eax == number
;out = eax:edx == ASCII HEX ready to be sent into some string
;fastadwz
;in eax == number
;dest == addr to dest to put ASCII HEX of the number in.
;fastadwszsp is the same as fastadwz but insert 20h between low and high words
;if eax == ffffffff then string in dest will be 'FFFF ;FFFF',0
; Test the speed ;) fastadw ~ 18 clocks.
;use: fastadwz proto :DWORD
; mov eax,number ;if it's not already in eax
; invoke fastadwz,addr dest ;or invoke fastadwszsp,addr dest
.586
.model flat,stdcall
option casemap:none
.code
fastadw proc
mov edx,eax
shl eax,4
and edx,0FFFF0000h
mov ebx,eax
shr edx,12
and eax,0ff0h
shr bh,4
shr al,4
and ebx,0f0f00h
mov ecx,edx
shl ebx,8
add eax,06060606h
and edx,0ff0h
add ebx,eax
shr ch,4
mov eax,ebx
shr dl,4
and ebx,10101010h
and ecx,0f0f00h
shr ebx,4
shl ecx,8
sub eax,ebx
add edx,06060606h
shl ebx,3
add eax,2a2a2a2ah
add ecx,edx
add eax,ebx
mov edx,ecx
bswap eax
;################################################
and ecx,10101010h
shr ecx,4
sub edx,ecx
shl ecx,3
add edx,2a2a2a2ah
add edx,ecx
bswap edx
ret
fastadw endp
fastadwsz proc dest:DWORD
call fastadw
mov ebx,dest
mov dword ptr ,edx
mov dword ptr ,eax
mov byte ptr ,0
ret
fastadwsz endp
fastadws proc dest:DWORD
call fastadw
mov ebx,dest
mov dword ptr ,edx
mov dword ptr ,eax
ret
fastadws endp
fastadwszsp proc dest :DWORD
call fastadw
mov ebx,dest
mov dword ptr ,edx
mov byte ptr ,' '
mov dword ptr ,eax
mov byte ptr ,0
ret
fastadwszsp endp
end
Svin,
Compliments on the published procedures, now here is the offer, tidy them up so they are reliable and make a module out of each one, document them so everyone can understand and use them and I will put them in the next version of MASM32 library so that they are available to any MASM programmer who needs them.
Put your copyright and email address at the top of each module so that everyone knows you are the author of the code and send them to me at my email address.
regards,
hutch@pbq.com.au
Thank you for the offer.
It will be a privilege. I'll do my best :)
In order to get the work done I have some questions to ask.
Is it OK to discuss them here or I'd better send them by e-mail?
For now on I want to say just few words:
I'm grateful user of MASM32 pack.
Many thanks to you and Iczelion for the work you've done on the Win32asm way :)
I dared to post some messages here after I'd downloaded new SP for MASM32.
I found a lot of new procs in M32LIB directory and thought may be it was time
I say some ideas and notes I'd never shared with anybody?
I love the idea to make Win32asm stdlibrary in the way it was done in MASM32,
and make use of most of them.
But I optimized most of them, so some of them run 2-3 times faster now.
I didn't change whole procedures, but little parts of them.
Critical parts :)
I always clock (testing speed) proc I accept to use.
Hope others do the same.
Two short examples:
Look at dwtoa.asm by Tim Roberts.
I like the proc, because it's easy to use and universal.
But with close look we can see that the most clock consuming part is:
; mov ecx, 10
; .while (eax > 0) ; while there is more to convert...
; xor edx, edx
; div ecx ; put next digit in edx
; add dl, '0' ; convert to ASCII
; mov , dl ; store it
; inc edi
; .endw
Why? Because the DIV command is still one of the slowest of +386 inst. set.
For Pentium it's 41 clock and is NP.
So it will take 48 clocks for each circle (iter.) up to 48*9 just to divide
whole number to get its MOD (10).
But if we replace the code above with this it'll do the same but 4 times
faster:
mov ecx,429496730
mov ebx,eax ;eax = num
mul ecx ;edx = num/10
mov eax,edx ;eax = num/10
lea edx,
add edx,edx ;edx =num - num mode(10)
sub ebx,edx ;ebx = num mode(10)
add bl,'0'
mov ,bl
inc edi
.while (eax > 0)
mov ebx,eax
mul ecx
mov eax,edx
lea edx,
add edx,edx
sub ebx,edx
add bl,'0'
mov ,bl
inc edi
.endw
If you doubt - test the speed.
This part after the correction runs 4 times faster. The whole proc 2,5 times
faster.
Another example - let's take a look at new A2DW.ASM by Iczelion.
Good and comprehensive proc.
xor ecx, ecx
mov edi, String
invoke lstrlen, String
.while eax != 0
xor edx, edx
mov dl, byte ptr
sub dl, "0" ; subtrack each digit with "0" to convert it to hex value
mov esi, eax
dec esi
push eax
mov eax, edx
push ebx
mov ebx, 10
.while esi > 0
mul ebx
dec esi
.endw
pop ebx
add ecx, eax
pop eax
inc edi
dec eax
.endw
mov eax, ecx
ret
Yet we can make it shorten and run 3 time faster:
xor ecx, ecx
mov edi, String
invoke lstrlen, String
.while eax != 0
xor edx, edx
mov dl, byte ptr
sub dl, "0"
mov esi, eax
.while esi > 0
lea edx,
add edx,edx
dec esi
.endw
add ecx, edx
inc edi
dec eax
.endw
mov eax, ecx
ret
Those procs are already in your MASM32 pack.
And I don't pretend to be the author :)
I just speed them up a little bit ;)
So I wonder, may you or the authors could be interested in this optimization
of their procs to think over those changed parts and may be persuaded to
replace current versions of the procs with optimized ones?
Excuse my ability to express myself by commands of the English language.
I'm not a native English speaker :)
Svin,
Your optimisations look great, I am sure any of the authors would be pleased to see the optimisations that you have done made available so that other assembler programmers can use them as well. A lot of the reason why programmers are writing in assembler is to get the speed advantage so any faster code will be appreciated.
The important thing with library modules is to get them reliable so that they work across the range that they are supposed to. The modules need to be stack parameter based so that they have a standard interface as register passed parameters are harder to use for many people.
The form used in the existing range of modules in the MASM32 library is the form that we need. One thing that is important is to put each procedure in its own module, this keeps the granularity of the library down so that unused code does not get included.
It is not a problem to replace an existing module with a faster one once it is reliable, I used one of Tim Roberts modules to replace one of mine, I have fully rewritten some of mine and I have replaced some earlier versions with faster ones.
As with other contributed code, nmake sure you put your name and copyright at the top of each module so that everyone knows who wrote the code.
As far as posting code, I think most would be pleased to see the optimisations that you are doing but its fine to send them to me when you are satisfied with them and they are reliable. Just send them to my email address.
I would not worry about your English, its a lot better than my Russian. :)
Regards,
hutch@pbq.com.au
Svin,
Please, supply the code you use to measure the perfomance.
I have one from MASM32 help, but it's not clear for me.
Please, choose (at least for me) another alias.
Me is russian too, it is very difficult to apply to
you with the present one.
Your fastadw is a piece of fantastic. I'm delighted.
DVA
;It should be corrected a little bit
;but at list it can give you basic idea.
;This code tests three 'stringlen' algorithm
;First is mine (I put it in the worst place :)
;, 2nd is macro from MASM32, and 3d -
;some old way to mesure string lenth.
;Try to extend the program to mesure the lstrlen API
;function.
;You'll be shocked how slow it runs.
;Esp. when you short the lenth of tested string.
;I love A.Fog and R.Hyde but Mr.Hyde must drop his HLA
;and begin teach us to use boolean algebra in 386 model flat
;world :)
;Just jocking ;) But I really need some fat book written just
;about algorithms in 386 model flat with lots of exser. and
;examples.
;Do you know one?
.586
.model flat,stdcall
option casemap:none
include C:\masm32\include\windows.inc
include C:\masm32\include\user32.inc
include C:\masm32\include\kernel32.inc
include C:\masm32\include\masm32.inc
includelib C:\masm32\lib\kernel32.lib
includelib C:\masm32\lib\user32.lib
includelib C:\masm32\lib\masm32.lib
TimeTest_ON macro
db 0fh,31h ;rdtsc - read (TSC)
push edx ;save TSC
push eax
endm
TimeTest_OFF macro
db 0fh,31h ;rdtsc - new TSC
pop ebx
pop ecx
sub eax,ebx
sbb edx,ecx
endm
.data
buffer db 100 dup('*'),0
MT db 'One circle has taken: ',13,10
CT db 12 dup (0)
MC db '1000 circles',0
.code
start:
mov ecx,1000
TimeTest_ON
testcl: push ecx
lea edi,buffer
lea edx,buffer
ALIGN 4
again: mov al,
inc edi
or al,al
jnz again
sub edx,edi
not edx
pop ecx
dec ecx
jnz testcl
TimeTest_OFF
xor edx,edx
mov ebx,1000
div ebx
invoke dwtoa,eax,addr CT
invoke MessageBox,0,addr MT,addr MC,MB_OK
mov ecx,1000
TimeTest_ON
testcl2: push ecx
lea edi,buffer
xor eax, eax ; zero eax as counter
align 4
l: ; cycles
mov dl, ; 1
inc edi ; 1
inc eax ; 1
cmp dl, 0 ; 1
jne l ; 3
dec eax ; correct eax for extra digit
pop ecx
dec ecx
jnz testcl2
TimeTest_OFF
xor edx,edx
mov ebx,1000
div ebx
invoke dwtoa,eax,addr CT
invoke MessageBox,0,addr MT,addr MC,MB_OK
mov ecx,1000
TimeTest_ON
test3: push ecx
xor al,al
lea edi,buffer
mov ecx,-1
repne scasb
inc ecx
not ecx
pop ecx
dec ecx
jnz test3
TimeTest_OFF
xor edx,edx
mov ebx,1000
div ebx
invoke dwtoa,eax,addr CT
invoke MessageBox,0,addr MT,addr MC,MB_OK
invoke ExitProcess,0
end start
Has anybody else tried testing the dw2hex proc written by f0dderin masm32. It seems it is about 20 times faster on my PC, (PII or PIII i can't remember). Its probably just me, or something to do with theway pentiums operate, with dual pipes, extra caches and the rest, cause i didn't get the same two counts twice?????
I would also like to point out, that the results of reading the clock counter can be serverly corrrupt if an interupt occours during the timing sequence (also in windows, the timer interupt could be called at least once every milli second if not shorter), so beware, you can either clear the interupt flag with a cli instruction (dont forget to set it again with sti), or repeat the test several times, also becareful cause if your app gets stuck in an infinite loop it WILL crash windozes.
Do you mean his (f0dder) trick?
add al, 90h
daa
adc al, 40h
daa
That's good one but can be replaced ;) with:
cmp al,10
sbb al,69h
das
Wich is shorter and twice faster :)
BTW: How could you make cli from Win32 ring3 ?
You simply cannot toutch CLI from ring3, it's a protected instruction.
I know, you want to stop multi tasking as you test your code. Can't do it that way.
The SIMPLEST way is to boost your test code's thread priority so the OS gives it the biggst slice of time, and do multiple runs testing run time, and take the lowest number.
The harder way involves writing a VxD or WDM. The investment in that depends on how serious you are in measuring speed. You still need to know how windows opperates at it's lowest levels intimately.
Personally, I'd test speed in DOS before I tried the driver approach.
Of course, we cannot use cli inst. from ring 3 :)
That's why I was surprized to read the X's advice to use it :)
I'd love to listen to anybody ideas on the subj. of speed testing. Though I might not always agree with the ones.
So thank you for sharing your aproach to the matter.
Actually I use complex method to get picture of perfomance.
First with pen and paper and instraction clock reference (or may be VTUNE).
Then in flat 386 DOS world (if it's possible and dont involves
somehow Win32 spec. texting)
And finaly in real Win32 env. with some other tasks running.
All the above gives me some info to think about.
For now, as I've noted, most tricky part is when testing algorithm using memory access to read or write in Win32.
I never can be 100 % sure what's waiting me to surprize with
every such a case :)
Of course, we cannot use cli inst. from ring 3 :)
That's why I was surprized to read the X's advice to use it :)
I'd love to listen to anybody ideas on the subj. of speed testing. Though I might not always agree with the ones.
So I thank you for sharing whith me your aproach to the matter.
Actually I use complex method to get picture of perfomance.
First with pen and paper and instraction clock reference (or may be VTUNE).
Then in flat 386 DOS world (if it's possible and dont involves
somehow Win32 spec. texting)
And finaly in real Win32 env. with some other tasks running.
All the above gives me some info to think about.
For now, as I've noted, most tricky part is when testing algorithm using memory access to read or write in Win32.
I never can be 100 % sure what's waiting me to surprize with
every such a case :)
Benchmarking in ring3 is a pain at best, my experience is that there is about 2-3% variation which is directly from Operating System interference. While the resolution from RDTSC is very good, it suffers the same variation in ring3 so it is of little use.
What I do to get an algorithm timed with some degree of accuracy is to run GetTickCount() on a test that is at least .5 to 1 second in duration and this gives an accuracy of about .5 of 1 percent. This is a lot simpler to do from ring3 access and it is reasonably hard to improve on in a hurry.
Depending on the type of algorithm to test, some need a very large buffer to run them properly, testing of the two string length algorithms in the MASM32 library was done with a 100 meg buffer in OLE string memory.
The reason for the two types is that the classic byte scanner is better suited for recursive small string reads where the algorithm by Agner Fog is clearly faster on long linear string scans. SCAS is a lot slower than both.
The InString algo was also benchmarked in this manner and it performs reasonably well, it has overlapping performance with a classic Boyer Moore string scanner but is a lot less complicated. I am of the view that the InString algo could do with some more optimisation but I have not had time to do it.
regards,
hutch@pbq.com.au
Svin,
thanks for the example.
> But I really need some fat book written just
> about algorithms in 386 model flat with lots of exser. and
> examples.Do you know one?
Try Intel site. It contains few libraries on math, image/
signal processing etc. Accordingly their readmes, they
contain highly optimised procs and are fully documented.
Libs are very big, I have not any so far, so the information
isn't checked.
DVA
it maybe different on NT, but, on my PC (Win98)
cli
@@:
jmp @B
is a grand way to make windows crash
And, theres always the long way round, of using GetThreadContext, SetThreadContext