From memory a P4 is slow with LEA although I have not found problems with using it myself. I asked you to try the second LEA version as I don't have a timing test bed set up to do it and I was trying to find out if your box does not handle seccessive LEA instructions without a stall.

Posted on 2003-04-23 11:03:14 by hutch--
That takes 28.5 clocks compared to 11.5 clocks of my code with the 11-op mult-92.
Posted on 2003-04-23 11:05:15 by Ekted
I think I've tried everything posted. What do you mean by the second LEA version?
Posted on 2003-04-23 11:09:03 by Ekted
if you care, here's my testbed.
time1 = intel C++ compiled simple code
time2 = some of ekted
time3 = some of roticv

Be careful, thread+proc priority is set to realtime, so your computer will seem to freeze while it's running. Takes a few seconds on my P4 2.53ghz.

Attachment removed, please download latest ""
Posted on 2003-04-23 11:09:29 by f0dder
hehehe... the part on a few seconds is underrated (Took me abt a minute). I tested it on my 600mhz win2ksp3 box.

time1 = 21500 ticks
time2 = 29937 ticks
time3 = 20734 ticks

Seems like shifts and lea are not slower than mul on my computer :grin:
Posted on 2003-04-23 11:14:36 by roticv
roticv, neither are doing straight MULs... I posted the code ICL generates already. Which CPU class? P3?
Posted on 2003-04-23 11:15:20 by f0dder
Scali's XP1800+ (1533 real mhz)
yodel1: 9253 ticks
yodel2: 12308 ticks
yodel3: 6839 ticks

My P4 2.53ghz
yodel1: 3500 ticks
yodel2: 5281 ticks
yodel3: 4407 ticks

I'll add hutches code shortly.
Posted on 2003-04-23 11:17:39 by f0dder
Yes. P3. Yours look like a monster compared with mine..
Posted on 2003-04-23 11:17:58 by roticv
I get:

time1 = 3735
time2 = 5360
time3 = 4562
Posted on 2003-04-23 11:18:19 by Ekted
hutches version:

;1000000 iterations of 2048 muls (2048000000 total) took 7516 ticks
push esi
push edi

mov esi, [esp+12]
mov edi, 2048 - 1

movzx eax, word [esi + (edi*2)]
; hutch code begin
lea ecx, [eax*8]
lea edx, [ecx*4]
lea ecx, [edx+edx*2]
lea edx, [eax*4]
sub ecx, edx
; hutch code end

mov [esi + (edi*2)], cx ; note CX for hutches version, not ax
dec edi
jnz .loop

pop edi
pop esi
ret 4

Attachment removed, please download latest ""
Posted on 2003-04-23 11:22:27 by f0dder
Intel compiler is clever. It does (n*3)^4-n. Is there an algorithmic way to get the best sequence of add/sub to obtain the shortest number of ops to get a particular multiplier?
Posted on 2003-04-23 11:23:00 by Ekted
Wow, hutch's method is so much faster :alright:

I get 17687 ticks.
Posted on 2003-04-23 11:25:18 by roticv
perhaps... AMD has an app that does it with some instructions. Intel obviously has some clever algorithm too. And it's easier to change the constant in the C++ source than manually working out lea/add/whatever :)

Time for some MMX/SSE/whatever soon? :)
Posted on 2003-04-23 11:25:18 by f0dder
Haha.. later. Time for me to catch some sleep. Probably this thread would race on while I am gone.
Posted on 2003-04-23 11:28:49 by roticv
Hutch's is MUCH slower for me, taking over 35 clocks.
Posted on 2003-04-23 11:29:22 by Ekted
yes, hutches is very slow for P4 - twice as slow as the fastest?
Here's a version of the C++ code compiled for P3, roticv please test! It seems faster than the P4 compiled code too, 3400 vs. 3500.

Attachment removed, please download latest ""
Posted on 2003-04-23 11:30:44 by f0dder
I get 3688 for that one.
Posted on 2003-04-23 11:33:09 by Ekted

P4 2.53:
yodel1: 3500 ticks ~4.33 clks
yodel2: 5281 ticks ~6.53 clks
yodel3: 4407 ticks ~5.45 clks

yodel1: 9253 ticks ~6.93 clks
yodel2: 12308 ticks ~9.21 clks
yodel3: 6839 ticks ~5.12 clks

Scali 18:31: Freaky, one would not expect the P4 to be faster than an Athlon.
Scali 18:31: It's probably the cache.
Scali 18:32: the routines are so short, that you notice the extra clk of latency on the L1 cache of the Athlon?
Scali 18:32: (and with some magic, the P4 manages to hide its high-latency shifter completely.
Posted on 2003-04-23 11:33:27 by f0dder
Okay, time to post my updated yodel benchmark.
Please read readme.txt and results.txt if you're going to comment.
If you're going to post results, rather email them to me and have me post a follow up - keeps stuff less cluttered. There's currently 13 benchmarks (I think) included, more are welcome.

It should be easy for people to play with the benchmark suite, one of the tests allows loading of a DLL. This is not at all optimal, for a bunch of reasons listed in the readme, but it's possible nevertheless.

Looking forward to hearing any comments.
Posted on 2003-04-23 17:41:12 by f0dder
attaching the zip helps.
Posted on 2003-04-23 17:41:36 by f0dder