I've created a "mult" macro that replaces (in some cases)
the slower "imul" instruction. (max 5 times faster than imul)
Currently, it supports all numbers between 0 and 50, and some others.
You can get it from www.beotel.net/~astancic (click on "Multiply.inc")
the slower "imul" instruction. (max 5 times faster than imul)
Currently, it supports all numbers between 0 and 50, and some others.
You can get it from www.beotel.net/~astancic (click on "Multiply.inc")
IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.
IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.
I disagree.
Writing a 'separate' macro for every number needed is, AFAIK,
simply the fastest method to MUL reg with a number.
I've created a list with all the numbers between 0-50, just
to see how it works, that doesn't mean that I need all of them.
In the future, I'll just add the numbers that I need.
For example, now I need to mul with 4000:
lea reg,
lea reg,
lea reg,
shl reg,5 ;(5*5*5*32)
can you make it faster than that? (its 44% faster than IMUL)
Note that on newer processor "IMUL reg, imm" takes 3 cycles only. Since your code seems to always use the same reg then nothing can be executed in parallel and hence it takes no less than 4 cycles. Additionally, the Athlon64 manuals says that LEA don't take one cycle but two when scalling is used, so perhaps we could be talking about a code that takes 7 cycles.
Of course your efforts are still good if your target is processors prior to Pentium Pro.
4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index
form LEA EAX, ). Forms with a scale or more than two source operands will have a latency of 2 (LEA
EAX, ).
form LEA EAX, ). Forms with a scale or more than two source operands will have a latency of 2 (LEA
EAX, ).
Of course your efforts are still good if your target is processors prior to Pentium Pro.
IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.
I disagree.
Writing a 'separate' macro for every number needed is, AFAIK,
simply the fastest method to MUL reg with a number.
I've created a list with all the numbers between 0-50, just
to see how it works, that doesn't mean that I need all of them.
In the future, I'll just add the numbers that I need.
For example, now I need to mul with 4000:
lea reg,
lea reg,
lea reg,
shl reg,5 ;(5*5*5*32)
can you make it faster than that? (its 44% faster than IMUL)
I have always been interested in the fractions and the percentages that people give. 44%? Where did you get the numerator and the denominator that have made that percentage?
Now about the actual code, the code that you gave:
LEA EAX ,
LEA EAX ,
LEA EAX ,
SHL EAX , 5
Takes 3 clock cycles to execute on my PIII 800MHZ with the code segment aligned on a DWORD boundary. The equivalent of this code using IMUL:
IMUL EAX , EAX , 4000
Takes 1 clock cycle on the same machine to execute. Now if the value that has to be multiplied is in a variable, your code will take 13 clock cycles to execute:
MOV EAX , DWORD PTR
LEA EAX ,
LEA EAX ,
LEA EAX ,
SHL EAX , 5
MOV DWORD PTR , EAX
this is while the equivalent of this code using IMUL will take 12 clock cycles on the same machine with the same conditions:
IMUL EAX , DWORD PTR , 4000
MOV DWORD PTR , EAX
XCHG, how do you measured the time? Did you remember to serialize before getting the TSC?
I just don't believe, yet, that any CPU can MUL in just one clock with a number such as 4000.
DECODE in one clock is OK, but the actual MUL...
Parallel execution (probably, I'm not sure with the IMUL instruction)
lets you do other stuff while IMUL is executing, but in my current app there's simply no other stuff to do.
Actually, I've made a mistake, it is 30%.
The numerator and denominator were given by the test prog below (MASM),
namely 150mS for the IMUL and 105mS for the macro.
;___________________________________________________________________________________________________________________________
number equ 4000
Start proc public
local text[128]:char
local count:UINT, count1:UINT, count2:UINT
local rez1:dword, rez2:dword
;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount ;invoke GetTickCount: mov , eax
mov eax,3
mov ecx,2500000
@@: imul eax,number
djnz ecx, @B ;DEC, JNZ
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount
mov eax,3
mov ecx,2500000
@@: mult eax,number
djnz ecx, @B
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
;---------------------------------------------------------------------------------------------------------------------------
% invoke wsprintf, a text, qt2(<'number=%u', cr,lf,cr,lf, 'imul = %u, mult = %u', cr,lf,cr,lf, 'rez1=%u', cr,lf, 'rez2=%u'>), number, , , ,
invoke MessageBox, null, a text, qt2('Test'), null
ret
Start endp
;___________________________________________________________________________________________________________________________
end Start
DECODE in one clock is OK, but the actual MUL...
Parallel execution (probably, I'm not sure with the IMUL instruction)
lets you do other stuff while IMUL is executing, but in my current app there's simply no other stuff to do.
Actually, I've made a mistake, it is 30%.
The numerator and denominator were given by the test prog below (MASM),
namely 150mS for the IMUL and 105mS for the macro.
;___________________________________________________________________________________________________________________________
number equ 4000
Start proc public
local text[128]:char
local count:UINT, count1:UINT, count2:UINT
local rez1:dword, rez2:dword
;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount ;invoke GetTickCount: mov , eax
mov eax,3
mov ecx,2500000
@@: imul eax,number
djnz ecx, @B ;DEC, JNZ
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount
mov eax,3
mov ecx,2500000
@@: mult eax,number
djnz ecx, @B
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
;---------------------------------------------------------------------------------------------------------------------------
% invoke wsprintf, a text, qt2(<'number=%u', cr,lf,cr,lf, 'imul = %u, mult = %u', cr,lf,cr,lf, 'rez1=%u', cr,lf, 'rez2=%u'>), number, , , ,
invoke MessageBox, null, a text, qt2('Test'), null
ret
Start endp
;___________________________________________________________________________________________________________________________
end Start
I like to put beliefs aside, and actually test algorithms side-by-side. Below are the results in milliseconds for each code
250,000,000 iterations (100 times more than what aleksaZR had put)
Sempron 3000+ , 1GB DDR@400 (timing 3, 4/8/4) . Thread-switching timer: 16,7ms.
aleksaZR, thanks for the macro. Just note that currently no end-user is on a 166MHz cpu, everyone's on a Sempron/AthlonXP/64 or a P4/C2D. Optimization for these beasts is usually a whole another matter.
P.S. the code to measure timing was:
P.P.S. : XCHG, percentages like that are (OldAlgoCycles/NewAlgoCycles - 1)*100 .
So, if OldAlgo took 36 cycles, and the new algo took 25 cycles, (36/25-1)*100 = 44% improvement. A 100% improvement is obviously a double-increase in speed.
250,000,000 iterations (100 times more than what aleksaZR had put)
Sempron 3000+ , 1GB DDR@400 (timing 3, 4/8/4) . Thread-switching timer: 16,7ms.
1: 454 vs 312
2: 468 vs 313
3: 468 vs 297
4: 469 vs 313
5: 469 vs 312
6: 469 vs 468
7: 453 vs 1016
8: 469 vs 313
9: 453 vs 313
10: 468 vs 469
11: 453 vs 953
12: 453 vs 469
13: 469 vs 1031
14: 469 vs 1125
15: 469 vs 625
16: 469 vs 313
17: 453 vs 1016
18: 453 vs 469
19: 469 vs 937
20: 468 vs 453
21: 468 vs 1032
22: 469 vs 1140
23: 453 vs 1094
24: 454 vs 468
25: 453 vs 625
26: 469 vs 1031
27: 453 vs 625
28: 469 vs 1031
29: 468 vs 985
30: 453 vs 781
31: 468 vs 1032
32: 468 vs 313
33: 454 vs 1015
34: 454 vs 1109
35: 454 vs 1046
36: 468 vs 469
37: 469 vs 1031
38: 454 vs 1250
39: 469 vs 1031
40: 469 vs 469
41: 469 vs 1031
42: 469 vs 1235
43: 469 vs 1250
44: 469 vs 1031
45: 453 vs 625
46: 468 vs 1032
47: 453 vs 1031
48: 468 vs 469
49: 469 vs 1031
50: 469 vs 781
51: 468 vs 985
52: 468 vs 1235
53: 453 vs 1000
54: 468 vs 766
55: 453 vs 1000
56: 454 vs 1093
57: 468 vs 1282
58: 469 vs 453
59: 468 vs 1125
60: 468 vs 922
61: 468 vs 1125
62: 453 vs 1187
63: 453 vs 1016
64: 469 vs 313
65: 453 vs 1016
72: 469 vs 453
80: 469 vs 469
81: 469 vs 625
90: 453 vs 781
96: 469 vs 469
100: 468 vs 766
1000: 468 vs 1079
4000: 468 vs 625
aleksaZR, thanks for the macro. Just note that currently no end-user is on a 166MHz cpu, everyone's on a Sempron/AthlonXP/64 or a P4/C2D. Optimization for these beasts is usually a whole another matter.
P.S. the code to measure timing was:
Start2 proc
local count, count1,count2
local rez1:dword, rez2:dword
invoke Sleep,0
;---------------------------------------------------------------------------------------------------------------------------
mov count,$invoke(GetTickCount)
mov eax,3
mov ecx,250000000
align 16
@@: imul eax,number
dec ecx
jnz @B
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
;---------------------------------------------------------------------------------------------------------------------------
mov count,$invoke(GetTickCount)
mov eax,3
mov ecx,250000000
align 16
@@: mult eax,number
dec ecx
jnz @B
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
trace "%d: %d vs %d",number,count1,count2
ret
Start2 endp
P.P.S. : XCHG, percentages like that are (OldAlgoCycles/NewAlgoCycles - 1)*100 .
So, if OldAlgo took 36 cycles, and the new algo took 25 cycles, (36/25-1)*100 = 44% improvement. A 100% improvement is obviously a double-increase in speed.
You forgot to set Realtime priority first and the number of iterations are very few to get a conclution. Besides, the loops are not aligned so perhaps one of them is enjoying a better alignment than the other.
The flat assembler program below gives me this
---------------------------
Test
---------------------------
number=4000
imul = 406, mult = 938
rez1=0
rez2=0
The flat assembler program below gives me this
I got the red warning before posting telling me that Ultrano already posted something but I couldn't resist myself to post my version :P
Since f0dder disliked my laziness now it also takes care about affinity ;)
Fixed a big mess I did (apparently I did :?) with tags
The flat assembler program below gives me this
---------------------------
Test
---------------------------
number=4000
imul = 406, mult = 938
rez1=0
rez2=0
The flat assembler program below gives me this
include 'win32axp.inc'
number equ 4000
proc Start
local text[128]:BYTE,\
count:DWORD, count1:DWORD, count2:DWORD,\
rez1:DWORD, rez2:DWORD,\
ProcessAffinityMask:DWORD, SystemAffinityMask:DWORD
push ebx esi
invoke GetCurrentProcess
mov ebx, eax
invoke SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
invoke GetCurrentThread
invoke SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL
;;;; On multi-{core|CPU} systems is required to set affinity as well but I don't have one :D
;;;; f0dder insisted in adding it anyway so here it is :P
invoke GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask
test eax, eax
jz .letThePartyBegin
; Lets choose only one of the processors allowed for this process
mov esi, 1
bsf ecx,
shl esi, cl
; Now, since Win9x/Me is smart enough to have GetProcessAffinityMask but not its counterpart we must check its existence first
invoke LoadLibrary, 'KERNEL32.DLL'
test eax, eax
jz .letThePartyBegin
invoke GetProcAddress, eax, 'SetProcessAffinityMask'
test eax, eax
jz .letThePartyBegin
stdcall eax, ebx, esi
.letThePartyBegin:
invoke Sleep, 20 ; Lets get a fresh slice
;---------------------------------------------------------------------------------------------------------------------------
invoke GetTickCount
mov , eax
mov eax,3
mov ecx, $10000000
align 16
@@:
imul eax,number
dec ecx
jnz @B
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
invoke Sleep, 20 ; Lets get a fresh slice
;---------------------------------------------------------------------------------------------------------------------------
invoke GetTickCount
mov , eax
mov eax,3
mov ecx, $10000000
@@:
reg equ eax
lea reg,
lea reg,
lea reg,
shl reg,5 ;(5*5*5*32)
restore reg
dec ecx
jnz @B
mov , eax
invoke GetTickCount
sub eax,
mov ,eax
;---------------------------------------------------------------------------------------------------------------------------
invoke wsprintf, addr text, fmt, number, , , ,
invoke MessageBox, NULL, addr text, 'Test', NULL
pop esi ebx
ret
endp
fmt db 'number=%u', 13, 10, 13, 10, 'imul = %u, mult = %u', 13, 10, 13, 10, 'rez1=%u', 13, 10, 'rez2=%u',0
;___________________________________________________________________________________________________________________________
.end Start
I got the red warning before posting telling me that Ultrano already posted something but I couldn't resist myself to post my version :P
Since f0dder disliked my laziness now it also takes care about affinity ;)
Fixed a big mess I did (apparently I did :?) with tags
LocoDelAssembly: it's a good idea to set the affinity even though you don't have a multicore machine... it doesn't really cost you anything, and means us people with multicore machines don't have to add it ourselves :P
Yes, I remember your advice from the other forum but seems that my disclaimer was not enough to avoid the suggestion ;)
I added it, but I can't assure if it is right because I don't remember how you did it.
I added it, but I can't assure if it is right because I don't remember how you did it.
I think everybody should be more supportive. Not about the actual code but about people' ideas. We are told everyday about the magic new hardware can do, etc. We do not need to be the optmization freaks all the time also. ASM is also about reinventing the wheel, since most of you, or at least me myself, write some code in another language (even on paper) before trying to do the same on asm.
XCHG: Maybe just the title of the topic was misunderstood. Maybe "alternative" instead of "replace" would be better but anyway I believe you guys are sometimes too cold hearted on simple ideas. I learn a lot from each line of code everybody puts up and it is good to know people still put some effort on trying to discover new ways of doing things (of all sorts), specially when programming is not my living it is good to see it. Sometimes I see people asming for 16bit also. I think there are even many information about it around somewhere. Well I hope you understand my point. I also love Symphony X but a community should be made of chances for everybody to show off work. Or so I think. Not sure if you guys agree with me or not.
XCHG: Maybe just the title of the topic was misunderstood. Maybe "alternative" instead of "replace" would be better but anyway I believe you guys are sometimes too cold hearted on simple ideas. I learn a lot from each line of code everybody puts up and it is good to know people still put some effort on trying to discover new ways of doing things (of all sorts), specially when programming is not my living it is good to see it. Sometimes I see people asming for 16bit also. I think there are even many information about it around somewhere. Well I hope you understand my point. I also love Symphony X but a community should be made of chances for everybody to show off work. Or so I think. Not sure if you guys agree with me or not.
I agree with codename. Imho aleksaZR's macro just needs polishing:
1) we should see the speed results of all supported numbers in mult, compared to imul. Like the ones I posted. Though, in my case I simply spent several minutes recompiling and running the app for each number ^^. The needed data to obtain this way is - with which numbers mult is faster than imul on all/most cpus.
2) change the mult macro to support only the faster-than-imul versions
3) also make a mult2 macro, which accepts 3 parameters (dest,temp,imm) - where "temp" will remove the need for stack-usage. This optimization won't be noticed as an improvement in the synthetic benchmarks, but can give an extra cycle or two (we save the write-back to memory, after all).
1) we should see the speed results of all supported numbers in mult, compared to imul. Like the ones I posted. Though, in my case I simply spent several minutes recompiling and running the app for each number ^^. The needed data to obtain this way is - with which numbers mult is faster than imul on all/most cpus.
2) change the mult macro to support only the faster-than-imul versions
3) also make a mult2 macro, which accepts 3 parameters (dest,temp,imm) - where "temp" will remove the need for stack-usage. This optimization won't be noticed as an improvement in the synthetic benchmarks, but can give an extra cycle or two (we save the write-back to memory, after all).
I've changed the mult macro to support only the faster-than-imul versions, as Ultrano suggested.
Same-as-imul versions are also included, for the sake of some older CPUs.
You can now get a new mult, an alternative to imul;) and use it if you like it.
Thanks, everyone
P.S.
Ultrano also suggested mult2 which would use a register instead of a stack.
That, at least on my CPU, didn't give much more speed,
60 mS instead of 75 mS (mult with 7).
Also, using the stack allows to multiply by, say 77, 7*11.
Both 7 and 11 would use the stack, esp-4 and esp-8, which would probably
be harder with the regs.
Anyway, since the newer CPUs do IMUL just fine, I've deleted all stack usage.
Same-as-imul versions are also included, for the sake of some older CPUs.
You can now get a new mult, an alternative to imul;) and use it if you like it.
Thanks, everyone
P.S.
Ultrano also suggested mult2 which would use a register instead of a stack.
That, at least on my CPU, didn't give much more speed,
60 mS instead of 75 mS (mult with 7).
Also, using the stack allows to multiply by, say 77, 7*11.
Both 7 and 11 would use the stack, esp-4 and esp-8, which would probably
be harder with the regs.
Anyway, since the newer CPUs do IMUL just fine, I've deleted all stack usage.
hello Aleksandre,
i also wrote a mul replacement recursive macro.
might be useful.
http://www.asmcommunity.net/board/index.php?topic=20441.0
i also wrote a mul replacement recursive macro.
might be useful.
http://www.asmcommunity.net/board/index.php?topic=20441.0