I've created a "mult" macro that replaces (in some cases)
the slower "imul" instruction. (max 5 times faster than imul)

Currently, it supports all numbers between 0 and 50, and some others.

You can get it from www.beotel.net/~astancic (click on "Multiply.inc")
Posted on 2007-07-28 06:17:15 by aleksaZR
IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.
Posted on 2007-07-28 11:53:41 by XCHG

IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.

I disagree.

Writing a 'separate' macro for every number needed is, AFAIK,
simply the fastest method to MUL reg with a number.

I've created a list with all the numbers between 0-50, just
to see how it works, that doesn't mean that I need all of them.

In the future, I'll just add the numbers that I need.

For example, now I need to mul with 4000:
lea reg,
lea reg,
lea reg,
shl reg,5 ;(5*5*5*32)
can you make it faster than that? (its 44% faster than IMUL)
Posted on 2007-07-29 05:34:48 by aleksaZR
Note that on newer processor "IMUL reg, imm" takes 3 cycles only. Since your code seems to always use the same reg then nothing can be executed in parallel and hence it takes no less than 4 cycles. Additionally, the Athlon64 manuals says that LEA don't take one cycle but two when scalling is used, so perhaps we could be talking about a code that takes 7 cycles.

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index
form LEA EAX, ). Forms with a scale or more than two source operands will have a latency of 2 (LEA
EAX, ).

Of course your efforts are still good if your target is processors prior to Pentium Pro.
Posted on 2007-07-29 09:19:36 by LocoDelAssembly

IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.

I disagree.

Writing a 'separate' macro for every number needed is, AFAIK,
simply the fastest method to MUL reg with a number.

I've created a list with all the numbers between 0-50, just
to see how it works, that doesn't mean that I need all of them.

In the future, I'll just add the numbers that I need.

For example, now I need to mul with 4000:
lea reg,
lea reg,
lea reg,
shl reg,5 ;(5*5*5*32)
can you make it faster than that? (its 44% faster than IMUL)

I have always been interested in the fractions and the percentages that people give. 44%? Where did you get the numerator and the denominator that have made that percentage?

Now about the actual code, the code that you gave:

``  LEA     EAX ,   LEA     EAX ,   LEA     EAX ,   SHL     EAX , 5``

Takes 3 clock cycles to execute on my PIII 800MHZ with the code segment aligned on a DWORD boundary. The equivalent of this code using IMUL:

``   IMUL     EAX , EAX , 4000``

Takes 1 clock cycle on the same machine to execute. Now if the value that has to be multiplied is in a variable, your code will take 13 clock cycles to execute:

``  MOV     EAX , DWORD PTR   LEA     EAX ,   LEA     EAX ,   LEA     EAX ,   SHL     EAX , 5  MOV     DWORD PTR  , EAX``

this is while the equivalent of this code using IMUL will take 12 clock cycles on the same machine with the same conditions:

``  IMUL    EAX , DWORD PTR  , 4000  MOV     DWORD PTR  , EAX``

Posted on 2007-07-29 12:19:51 by XCHG
XCHG, how do you measured the time? Did you remember to serialize before getting the TSC?
Posted on 2007-07-29 12:26:28 by LocoDelAssembly
I just don't believe, yet, that any CPU can MUL in just one clock with a number such as 4000.
DECODE in one clock is OK, but the actual MUL...

Parallel execution (probably, I'm not sure with the IMUL instruction)
lets you do other stuff while IMUL is executing, but in my current app there's simply no other stuff to do.

Actually, I've made a mistake, it is 30%.
The numerator and denominator were given by the test prog below (MASM),
namely 150mS for the IMUL and 105mS for the macro.

;___________________________________________________________________________________________________________________________
number equ 4000

Start proc public
local text[128]:char
local count:UINT, count1:UINT, count2:UINT
local rez1:dword, rez2:dword

;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount ;invoke GetTickCount: mov , eax

mov eax,3
mov ecx,2500000
@@: imul eax,number
djnz ecx, @B ;DEC, JNZ

mov , eax

invoke GetTickCount
sub eax,
mov ,eax

;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount

mov eax,3
mov ecx,2500000
@@: mult eax,number
djnz ecx, @B

mov , eax

invoke GetTickCount
sub eax,
mov ,eax

;---------------------------------------------------------------------------------------------------------------------------
% invoke wsprintf, a text, qt2(<'number=%u', cr,lf,cr,lf, 'imul = %u, mult = %u', cr,lf,cr,lf, 'rez1=%u', cr,lf, 'rez2=%u'>), number, , , ,
invoke MessageBox, null, a text, qt2('Test'), null
ret
Start endp
;___________________________________________________________________________________________________________________________
end Start
Posted on 2007-07-29 14:59:57 by aleksaZR
I like to put beliefs aside, and actually test algorithms side-by-side. Below are the results in milliseconds for each code

250,000,000 iterations (100 times more than what aleksaZR had put)
Sempron 3000+ , 1GB DDR@400 (timing 3, 4/8/4) . Thread-switching timer: 16,7ms.

``1:	454 vs 3122:	468 vs 3133:	468 vs 2974:	469 vs 3135:	469 vs 3126:	469 vs 4687:	453 vs 10168:	469 vs 3139:	453 vs 31310:	468 vs 46911:	453 vs 95312:	453 vs 46913:	469 vs 103114:	469 vs 112515:	469 vs 62516:	469 vs 31317:	453 vs 101618:	453 vs 46919:	469 vs 93720:	468 vs 45321:	468 vs 103222:	469 vs 114023:	453 vs 109424:	454 vs 46825:	453 vs 62526:	469 vs 103127:	453 vs 62528:	469 vs 103129:	468 vs 98530:	453 vs 78131:	468 vs 103232:	468 vs 31333:	454 vs 101534:	454 vs 110935:	454 vs 104636:	468 vs 46937:	469 vs 103138:	454 vs 125039:	469 vs 103140:	469 vs 46941:	469 vs 103142:	469 vs 123543:	469 vs 125044:	469 vs 103145:	453 vs 62546:	468 vs 103247:	453 vs 103148:	468 vs 46949:	469 vs 103150:	469 vs 78151:	468 vs 98552:	468 vs 123553:	453 vs 100054:	468 vs 76655:	453 vs 100056:	454 vs 109357:	468 vs 128258:	469 vs 45359:	468 vs 112560:	468 vs 92261:	468 vs 112562:	453 vs 118763:	453 vs 101664:	469 vs 31365:	453 vs 101672:	469 vs 45380:	469 vs 46981:	469 vs 62590:	453 vs 78196:	469 vs 469100:	468 vs 7661000:	468 vs 10794000:	468 vs 625``

aleksaZR, thanks for the macro. Just note that currently no end-user is on a 166MHz cpu, everyone's on a Sempron/AthlonXP/64 or a P4/C2D. Optimization for these beasts is usually a whole another matter.

P.S. the code to measure timing was:
``Start2 proc	local count, count1,count2	local rez1:dword, rez2:dword	invoke Sleep,0;---------------------------------------------------------------------------------------------------------------------------	mov count,\$invoke(GetTickCount)      mov   eax,3      mov   ecx,250000000align 16@@:      imul   eax,number	dec ecx	jnz  @B      mov   , eax      invoke   GetTickCount      sub   eax,      mov   ,eax;---------------------------------------------------------------------------------------------------------------------------      mov count,\$invoke(GetTickCount)      mov   eax,3      mov   ecx,250000000	align 16@@:      mult   eax,number	dec ecx	jnz @B      mov   , eax      invoke   GetTickCount      sub   eax,      mov   ,eax	trace "%d:	%d vs %d",number,count1,count2	retStart2 endp``

P.P.S. :  XCHG, percentages like that are (OldAlgoCycles/NewAlgoCycles - 1)*100 .
So, if OldAlgo took 36 cycles, and the new algo took 25 cycles, (36/25-1)*100 = 44% improvement. A 100% improvement is obviously a double-increase in speed.
Posted on 2007-07-29 16:36:13 by Ultrano
You forgot to set Realtime priority first and the number of iterations are very few to get a conclution. Besides, the loops are not aligned so perhaps one of them is enjoying a better alignment than the other.

The flat assembler program below gives me this

---------------------------
Test
---------------------------
number=4000
imul = 406, mult = 938
rez1=0
rez2=0

The flat assembler program below gives me this
``include 'win32axp.inc'number equ 4000proc Startlocal text[128]:BYTE,\      count:DWORD, count1:DWORD, count2:DWORD,\      rez1:DWORD, rez2:DWORD,\      ProcessAffinityMask:DWORD, SystemAffinityMask:DWORD      push     ebx esi      invoke   GetCurrentProcess      mov      ebx, eax      invoke   SetPriorityClass, eax, REALTIME_PRIORITY_CLASS      invoke   GetCurrentThread      invoke   SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL;;;; On multi-{core|CPU} systems is required to set affinity as well but I don't have one :D;;;; f0dder insisted in adding it anyway so here it is :P      invoke   GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask      test     eax, eax      jz       .letThePartyBegin; Lets choose only one of the processors allowed for this process      mov      esi, 1      bsf      ecx,       shl      esi, cl; Now, since Win9x/Me is smart enough to have GetProcessAffinityMask but not its counterpart we must check its existence first      invoke   LoadLibrary, 'KERNEL32.DLL'      test     eax, eax      jz       .letThePartyBegin      invoke   GetProcAddress, eax, 'SetProcessAffinityMask'      test     eax, eax      jz       .letThePartyBegin      stdcall  eax, ebx, esi.letThePartyBegin:      invoke Sleep, 20 ; Lets get a fresh slice;---------------------------------------------------------------------------------------------------------------------------      invoke GetTickCount      mov    , eax      mov   eax,3      mov   ecx, \$10000000align 16@@:      imul   eax,number      dec    ecx      jnz    @B      mov   , eax      invoke   GetTickCount      sub      eax,      mov      ,eax      invoke Sleep, 20 ; Lets get a fresh slice;---------------------------------------------------------------------------------------------------------------------------      invoke GetTickCount      mov    , eax      mov   eax,3      mov   ecx, \$10000000@@:reg equ eax      lea   reg,      lea   reg,      lea   reg,      shl   reg,5         ;(5*5*5*32)restore reg      dec    ecx      jnz    @B      mov   , eax      invoke   GetTickCount      sub   eax,      mov   ,eax;---------------------------------------------------------------------------------------------------------------------------      invoke   wsprintf, addr text, fmt, number, , , ,       invoke   MessageBox, NULL, addr text, 'Test', NULL      pop      esi ebx      retendpfmt db 'number=%u', 13, 10, 13, 10, 'imul = %u, mult = %u', 13, 10, 13, 10, 'rez1=%u', 13, 10, 'rez2=%u',0;___________________________________________________________________________________________________________________________.end Start``

I got the red warning before posting telling me that Ultrano already posted something but I couldn't resist myself to post my version :P

Since f0dder disliked my laziness now it also takes care about affinity ;)

Fixed a big mess I did (apparently I did :?) with tags
Posted on 2007-07-29 16:52:50 by LocoDelAssembly
LocoDelAssembly: it's a good idea to set the affinity even though you don't have a multicore machine... it doesn't really cost you anything, and means us people with multicore machines don't have to add it ourselves :P
Posted on 2007-07-29 17:13:26 by f0dder
Yes, I remember your advice from the other forum but seems that my disclaimer was not enough to avoid the suggestion ;)

I added it, but I can't assure if it is right because I don't remember how you did it.
Posted on 2007-07-29 17:51:18 by LocoDelAssembly
I think everybody should be more supportive. Not about the actual code but about people' ideas. We are told everyday about the magic new hardware can do, etc. We do not need to be the optmization freaks all the time also. ASM is also about reinventing the wheel, since most of you, or at least me myself, write some code in another language (even on paper) before trying to do the same on asm.

XCHG: Maybe just the title of the topic was misunderstood. Maybe "alternative" instead of "replace" would be better but anyway I believe you guys are sometimes too cold hearted on simple ideas. I learn a lot from each line of code everybody puts up and it is good to know people still put some effort on trying to discover new ways of doing things (of all sorts), specially when programming is not my living it is good to see it. Sometimes I see people asming for 16bit also. I think there are even many information about it around somewhere. Well I hope you understand my point. I also love Symphony X but a community should be made of chances for everybody to show off work. Or so I think. Not sure if you guys agree with me or not.
Posted on 2007-07-29 20:51:44 by codename
I agree with codename. Imho aleksaZR's macro just needs polishing:
1) we should see the speed results of all supported numbers in mult, compared to imul. Like the ones I posted. Though, in my case I simply spent several minutes recompiling and running the app for each number ^^. The needed data to obtain this way is - with which numbers mult is faster than imul on all/most cpus.
2) change the mult macro to support only the faster-than-imul versions
3) also make a mult2 macro, which accepts 3 parameters (dest,temp,imm) - where "temp" will remove the need for stack-usage. This optimization won't be noticed as an improvement in the synthetic benchmarks, but can give an extra cycle or two (we save the write-back to memory, after all).
Posted on 2007-07-29 22:28:07 by Ultrano
I've changed the mult macro to support only the faster-than-imul versions, as Ultrano suggested.
Same-as-imul versions are also included, for the sake of some older CPUs.

You can now get a new mult, an alternative to imul;) and use it if you like it.

Thanks, everyone

P.S.
Ultrano also suggested mult2 which would use a register instead of a stack.
That, at least on my CPU, didn't give much more speed,
60 mS instead of 75 mS (mult with 7).

Also, using the stack allows to multiply by, say 77, 7*11.
Both 7 and 11 would use the stack, esp-4 and esp-8, which would probably
be harder with the regs.

Anyway, since the newer CPUs do IMUL just fine, I've deleted all stack usage.
Posted on 2007-07-30 02:37:59 by aleksaZR
hello Aleksandre,
i also wrote a mul replacement recursive macro.
might be useful.
http://www.asmcommunity.net/board/index.php?topic=20441.0
Posted on 2007-08-07 17:33:01 by drizz