I've created a "mult" macro that replaces (in some cases)
the slower "imul" instruction. (max 5 times faster than imul)

Currently, it supports all numbers between 0 and 50, and some others.

You can get it from www.beotel.net/~astancic (click on "Multiply.inc")
Posted on 2007-07-28 06:17:15 by aleksaZR
IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.
Posted on 2007-07-28 11:53:41 by XCHG

IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.

I disagree.

Writing a 'separate' macro for every number needed is, AFAIK,
simply the fastest method to MUL reg with a number.

I've created a list with all the numbers between 0-50, just
to see how it works, that doesn't mean that I need all of them.

In the future, I'll just add the numbers that I need.

For example, now I need to mul with 4000:
lea reg,
lea reg,
lea reg,
shl reg,5 ;(5*5*5*32)
can you make it faster than that? (its 44% faster than IMUL)
Posted on 2007-07-29 05:34:48 by aleksaZR
Note that on newer processor "IMUL reg, imm" takes 3 cycles only. Since your code seems to always use the same reg then nothing can be executed in parallel and hence it takes no less than 4 cycles. Additionally, the Athlon64 manuals says that LEA don't take one cycle but two when scalling is used, so perhaps we could be talking about a code that takes 7 cycles.

4. LEA instructions have a latency of 1 when there are two source operands (as in the case of the base + index
form LEA EAX, ). Forms with a scale or more than two source operands will have a latency of 2 (LEA
EAX, ).


Of course your efforts are still good if your target is processors prior to Pentium Pro.
Posted on 2007-07-29 09:19:36 by LocoDelAssembly


IMHO, writing a separate algorithm for every multiplication factor is inane. We discussed the Multiplication Algorithm subject a while ago. Maybe that will help you create a better macro. Good luck.

I disagree.

Writing a 'separate' macro for every number needed is, AFAIK,
simply the fastest method to MUL reg with a number.

I've created a list with all the numbers between 0-50, just
to see how it works, that doesn't mean that I need all of them.

In the future, I'll just add the numbers that I need.

For example, now I need to mul with 4000:
lea reg,
lea reg,
lea reg,
shl reg,5 ;(5*5*5*32)
can you make it faster than that? (its 44% faster than IMUL)




I have always been interested in the fractions and the percentages that people give. 44%? Where did you get the numerator and the denominator that have made that percentage?

Now about the actual code, the code that you gave:

  LEA     EAX , 
  LEA    EAX ,
  LEA    EAX ,
  SHL    EAX , 5


Takes 3 clock cycles to execute on my PIII 800MHZ with the code segment aligned on a DWORD boundary. The equivalent of this code using IMUL:

   IMUL     EAX , EAX , 4000


Takes 1 clock cycle on the same machine to execute. Now if the value that has to be multiplied is in a variable, your code will take 13 clock cycles to execute:

  MOV     EAX , DWORD PTR 
  LEA    EAX ,
  LEA    EAX ,
  LEA    EAX ,
  SHL    EAX , 5
  MOV    DWORD PTR , EAX


this is while the equivalent of this code using IMUL will take 12 clock cycles on the same machine with the same conditions:

  IMUL    EAX , DWORD PTR  , 4000
  MOV    DWORD PTR , EAX

Posted on 2007-07-29 12:19:51 by XCHG
XCHG, how do you measured the time? Did you remember to serialize before getting the TSC?
Posted on 2007-07-29 12:26:28 by LocoDelAssembly
I just don't believe, yet, that any CPU can MUL in just one clock with a number such as 4000.
DECODE in one clock is OK, but the actual MUL...

Parallel execution (probably, I'm not sure with the IMUL instruction)
lets you do other stuff while IMUL is executing, but in my current app there's simply no other stuff to do.

Actually, I've made a mistake, it is 30%.
The numerator and denominator were given by the test prog below (MASM),
namely 150mS for the IMUL and 105mS for the macro.

;___________________________________________________________________________________________________________________________
number equ 4000

Start proc public
local text[128]:char
local count:UINT, count1:UINT, count2:UINT
local rez1:dword, rez2:dword

;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount ;invoke GetTickCount: mov , eax

mov eax,3
mov ecx,2500000
@@: imul eax,number
djnz ecx, @B ;DEC, JNZ

mov , eax

invoke GetTickCount
sub eax,
mov ,eax

;---------------------------------------------------------------------------------------------------------------------------
status , GetTickCount

mov eax,3
mov ecx,2500000
@@: mult eax,number
djnz ecx, @B

mov , eax

invoke GetTickCount
sub eax,
mov ,eax

;---------------------------------------------------------------------------------------------------------------------------
% invoke wsprintf, a text, qt2(<'number=%u', cr,lf,cr,lf, 'imul = %u, mult = %u', cr,lf,cr,lf, 'rez1=%u', cr,lf, 'rez2=%u'>), number, , , ,
invoke MessageBox, null, a text, qt2('Test'), null
ret
Start endp
;___________________________________________________________________________________________________________________________
end Start
Posted on 2007-07-29 14:59:57 by aleksaZR
I like to put beliefs aside, and actually test algorithms side-by-side. Below are the results in milliseconds for each code

250,000,000 iterations (100 times more than what aleksaZR had put)
Sempron 3000+ , 1GB DDR@400 (timing 3, 4/8/4) . Thread-switching timer: 16,7ms.


1: 454 vs 312
2: 468 vs 313
3: 468 vs 297
4: 469 vs 313
5: 469 vs 312
6: 469 vs 468
7: 453 vs 1016
8: 469 vs 313
9: 453 vs 313
10: 468 vs 469
11: 453 vs 953
12: 453 vs 469
13: 469 vs 1031
14: 469 vs 1125
15: 469 vs 625
16: 469 vs 313
17: 453 vs 1016
18: 453 vs 469
19: 469 vs 937
20: 468 vs 453
21: 468 vs 1032
22: 469 vs 1140
23: 453 vs 1094
24: 454 vs 468
25: 453 vs 625
26: 469 vs 1031
27: 453 vs 625
28: 469 vs 1031
29: 468 vs 985
30: 453 vs 781
31: 468 vs 1032
32: 468 vs 313
33: 454 vs 1015
34: 454 vs 1109
35: 454 vs 1046
36: 468 vs 469
37: 469 vs 1031
38: 454 vs 1250
39: 469 vs 1031
40: 469 vs 469
41: 469 vs 1031
42: 469 vs 1235
43: 469 vs 1250
44: 469 vs 1031
45: 453 vs 625
46: 468 vs 1032
47: 453 vs 1031
48: 468 vs 469
49: 469 vs 1031
50: 469 vs 781
51: 468 vs 985
52: 468 vs 1235
53: 453 vs 1000
54: 468 vs 766
55: 453 vs 1000
56: 454 vs 1093
57: 468 vs 1282
58: 469 vs 453
59: 468 vs 1125
60: 468 vs 922
61: 468 vs 1125
62: 453 vs 1187
63: 453 vs 1016
64: 469 vs 313
65: 453 vs 1016
72: 469 vs 453
80: 469 vs 469
81: 469 vs 625
90: 453 vs 781
96: 469 vs 469
100: 468 vs 766
1000: 468 vs 1079
4000: 468 vs 625


aleksaZR, thanks for the macro. Just note that currently no end-user is on a 166MHz cpu, everyone's on a Sempron/AthlonXP/64 or a P4/C2D. Optimization for these beasts is usually a whole another matter.

P.S. the code to measure timing was:

Start2 proc
local count, count1,count2
local rez1:dword, rez2:dword
invoke Sleep,0

;---------------------------------------------------------------------------------------------------------------------------
mov count,$invoke(GetTickCount)

      mov  eax,3
      mov  ecx,250000000
align 16
@@:      imul  eax,number
dec ecx
jnz  @B

      mov  , eax

      invoke  GetTickCount
      sub  eax,
      mov  ,eax

;---------------------------------------------------------------------------------------------------------------------------
      mov count,$invoke(GetTickCount)

      mov  eax,3
      mov  ecx,250000000
align 16
@@:      mult  eax,number
dec ecx
jnz @B

      mov  , eax

      invoke  GetTickCount
      sub  eax,
      mov  ,eax
trace "%d: %d vs %d",number,count1,count2
ret
Start2 endp



P.P.S. :  XCHG, percentages like that are (OldAlgoCycles/NewAlgoCycles - 1)*100 .
So, if OldAlgo took 36 cycles, and the new algo took 25 cycles, (36/25-1)*100 = 44% improvement. A 100% improvement is obviously a double-increase in speed.
Posted on 2007-07-29 16:36:13 by Ultrano
You forgot to set Realtime priority first and the number of iterations are very few to get a conclution. Besides, the loops are not aligned so perhaps one of them is enjoying a better alignment than the other.

The flat assembler program below gives me this

---------------------------
Test
---------------------------
number=4000
imul = 406, mult = 938
rez1=0
rez2=0


The flat assembler program below gives me this
include 'win32axp.inc'

number equ 4000

proc Start
local text[128]:BYTE,\
      count:DWORD, count1:DWORD, count2:DWORD,\
      rez1:DWORD, rez2:DWORD,\
      ProcessAffinityMask:DWORD, SystemAffinityMask:DWORD

      push     ebx esi

      invoke   GetCurrentProcess
      mov      ebx, eax
      invoke   SetPriorityClass, eax, REALTIME_PRIORITY_CLASS
      invoke   GetCurrentThread
      invoke   SetThreadPriority, eax, THREAD_PRIORITY_TIME_CRITICAL

;;;; On multi-{core|CPU} systems is required to set affinity as well but I don't have one :D
;;;; f0dder insisted in adding it anyway so here it is :P

      invoke   GetProcessAffinityMask, ebx, addr ProcessAffinityMask, addr SystemAffinityMask
      test     eax, eax
      jz       .letThePartyBegin

; Lets choose only one of the processors allowed for this process
      mov      esi, 1
      bsf      ecx,
      shl      esi, cl

; Now, since Win9x/Me is smart enough to have GetProcessAffinityMask but not its counterpart we must check its existence first

      invoke   LoadLibrary, 'KERNEL32.DLL'
      test     eax, eax
      jz       .letThePartyBegin

      invoke   GetProcAddress, eax, 'SetProcessAffinityMask'
      test     eax, eax
      jz       .letThePartyBegin

      stdcall  eax, ebx, esi

.letThePartyBegin:
      invoke Sleep, 20 ; Lets get a fresh slice
;---------------------------------------------------------------------------------------------------------------------------
      invoke GetTickCount
      mov    , eax

      mov   eax,3
      mov   ecx, $10000000

align 16
@@:
      imul   eax,number
      dec    ecx
      jnz    @B

      mov   , eax

      invoke   GetTickCount
      sub      eax,
      mov      ,eax

      invoke Sleep, 20 ; Lets get a fresh slice
;---------------------------------------------------------------------------------------------------------------------------
      invoke GetTickCount
      mov    , eax

      mov   eax,3
      mov   ecx, $10000000

@@:
reg equ eax
      lea   reg,
      lea   reg,
      lea   reg,
      shl   reg,5         ;(5*5*5*32)
restore reg

      dec    ecx
      jnz    @B

      mov   , eax

      invoke   GetTickCount
      sub   eax,
      mov   ,eax

;---------------------------------------------------------------------------------------------------------------------------
      invoke   wsprintf, addr text, fmt, number, , , ,
      invoke   MessageBox, NULL, addr text, 'Test', NULL

      pop      esi ebx
      ret
endp

fmt db 'number=%u', 13, 10, 13, 10, 'imul = %u, mult = %u', 13, 10, 13, 10, 'rez1=%u', 13, 10, 'rez2=%u',0
;___________________________________________________________________________________________________________________________
.end Start

I got the red warning before posting telling me that Ultrano already posted something but I couldn't resist myself to post my version :P

Since f0dder disliked my laziness now it also takes care about affinity ;)

Fixed a big mess I did (apparently I did :?) with tags
Posted on 2007-07-29 16:52:50 by LocoDelAssembly
LocoDelAssembly: it's a good idea to set the affinity even though you don't have a multicore machine... it doesn't really cost you anything, and means us people with multicore machines don't have to add it ourselves :P
Posted on 2007-07-29 17:13:26 by f0dder
Yes, I remember your advice from the other forum but seems that my disclaimer was not enough to avoid the suggestion ;)

I added it, but I can't assure if it is right because I don't remember how you did it.
Posted on 2007-07-29 17:51:18 by LocoDelAssembly
I think everybody should be more supportive. Not about the actual code but about people' ideas. We are told everyday about the magic new hardware can do, etc. We do not need to be the optmization freaks all the time also. ASM is also about reinventing the wheel, since most of you, or at least me myself, write some code in another language (even on paper) before trying to do the same on asm.

XCHG: Maybe just the title of the topic was misunderstood. Maybe "alternative" instead of "replace" would be better but anyway I believe you guys are sometimes too cold hearted on simple ideas. I learn a lot from each line of code everybody puts up and it is good to know people still put some effort on trying to discover new ways of doing things (of all sorts), specially when programming is not my living it is good to see it. Sometimes I see people asming for 16bit also. I think there are even many information about it around somewhere. Well I hope you understand my point. I also love Symphony X but a community should be made of chances for everybody to show off work. Or so I think. Not sure if you guys agree with me or not.
Posted on 2007-07-29 20:51:44 by codename
I agree with codename. Imho aleksaZR's macro just needs polishing:
1) we should see the speed results of all supported numbers in mult, compared to imul. Like the ones I posted. Though, in my case I simply spent several minutes recompiling and running the app for each number ^^. The needed data to obtain this way is - with which numbers mult is faster than imul on all/most cpus.
2) change the mult macro to support only the faster-than-imul versions
3) also make a mult2 macro, which accepts 3 parameters (dest,temp,imm) - where "temp" will remove the need for stack-usage. This optimization won't be noticed as an improvement in the synthetic benchmarks, but can give an extra cycle or two (we save the write-back to memory, after all).
Posted on 2007-07-29 22:28:07 by Ultrano
I've changed the mult macro to support only the faster-than-imul versions, as Ultrano suggested.
Same-as-imul versions are also included, for the sake of some older CPUs.

You can now get a new mult, an alternative to imul;) and use it if you like it.

Thanks, everyone

P.S.
Ultrano also suggested mult2 which would use a register instead of a stack.
That, at least on my CPU, didn't give much more speed,
60 mS instead of 75 mS (mult with 7).

Also, using the stack allows to multiply by, say 77, 7*11.
Both 7 and 11 would use the stack, esp-4 and esp-8, which would probably
be harder with the regs.

Anyway, since the newer CPUs do IMUL just fine, I've deleted all stack usage.
Posted on 2007-07-30 02:37:59 by aleksaZR
hello Aleksandre,
i also wrote a mul replacement recursive macro.
might be useful.
http://www.asmcommunity.net/board/index.php?topic=20441.0
Posted on 2007-08-07 17:33:01 by drizz