From memory a P4 is slow with LEA although I have not found problems with using it myself. I asked you to try the second LEA version as I don't have a timing test bed set up to do it and I was trying to find out if your box does not handle seccessive LEA instructions without a stall.
Regards,
hutch@movsd.com
Regards,
hutch@movsd.com
That takes 28.5 clocks compared to 11.5 clocks of my code with the 11-op mult-92.
I think I've tried everything posted. What do you mean by the second LEA version?
if you care, here's my testbed.
time1 = intel C++ compiled simple code
time2 = some of ekted
time3 = some of roticv
Be careful, thread+proc priority is set to realtime, so your computer will seem to freeze while it's running. Takes a few seconds on my P4 2.53ghz.
Attachment removed, please download latest "yodel_whatever_final.zip"
time1 = intel C++ compiled simple code
time2 = some of ekted
time3 = some of roticv
Be careful, thread+proc priority is set to realtime, so your computer will seem to freeze while it's running. Takes a few seconds on my P4 2.53ghz.
Attachment removed, please download latest "yodel_whatever_final.zip"
hehehe... the part on a few seconds is underrated (Took me abt a minute). I tested it on my 600mhz win2ksp3 box.
Results
time1 = 21500 ticks
time2 = 29937 ticks
time3 = 20734 ticks
Seems like shifts and lea are not slower than mul on my computer :grin:
Results
time1 = 21500 ticks
time2 = 29937 ticks
time3 = 20734 ticks
Seems like shifts and lea are not slower than mul on my computer :grin:
roticv, neither are doing straight MULs... I posted the code ICL generates already. Which CPU class? P3?
Scali's XP1800+ (1533 real mhz)
yodel1: 9253 ticks
yodel2: 12308 ticks
yodel3: 6839 ticks
My P4 2.53ghz
yodel1: 3500 ticks
yodel2: 5281 ticks
yodel3: 4407 ticks
I'll add hutches code shortly.
yodel1: 9253 ticks
yodel2: 12308 ticks
yodel3: 6839 ticks
My P4 2.53ghz
yodel1: 3500 ticks
yodel2: 5281 ticks
yodel3: 4407 ticks
I'll add hutches code shortly.
Yes. P3. Yours look like a monster compared with mine..
I get:
time1 = 3735
time2 = 5360
time3 = 4562
time1 = 3735
time2 = 5360
time3 = 4562
hutches version:
Attachment removed, please download latest "yodel_whatever_final.zip"
;1000000 iterations of 2048 muls (2048000000 total) took 7516 ticks
_time4@4:
push esi
push edi
mov esi, [esp+12]
mov edi, 2048 - 1
.loop:
movzx eax, word [esi + (edi*2)]
; hutch code begin
lea ecx, [eax*8]
lea edx, [ecx*4]
lea ecx, [edx+edx*2]
lea edx, [eax*4]
sub ecx, edx
; hutch code end
mov [esi + (edi*2)], cx ; note CX for hutches version, not ax
dec edi
jnz .loop
pop edi
pop esi
ret 4
Attachment removed, please download latest "yodel_whatever_final.zip"
Intel compiler is clever. It does (n*3)^4-n. Is there an algorithmic way to get the best sequence of add/sub to obtain the shortest number of ops to get a particular multiplier?
Wow, hutch's method is so much faster :alright:
I get 17687 ticks.
I get 17687 ticks.
perhaps... AMD has an app that does it with some instructions. Intel obviously has some clever algorithm too. And it's easier to change the constant in the C++ source than manually working out lea/add/whatever :)
Time for some MMX/SSE/whatever soon? :)
Time for some MMX/SSE/whatever soon? :)
Haha.. later. Time for me to catch some sleep. Probably this thread would race on while I am gone.
Hutch's is MUCH slower for me, taking over 35 clocks.
yes, hutches is very slow for P4 - twice as slow as the fastest?
Here's a version of the C++ code compiled for P3, roticv please test! It seems faster than the P4 compiled code too, 3400 vs. 3500.
Attachment removed, please download latest "yodel_whatever_final.zip"
Here's a version of the C++ code compiled for P3, roticv please test! It seems faster than the P4 compiled code too, 3400 vs. 3500.
Attachment removed, please download latest "yodel_whatever_final.zip"
I get 3688 for that one.
P4 2.53:
yodel1: 3500 ticks ~4.33 clks
yodel2: 5281 ticks ~6.53 clks
yodel3: 4407 ticks ~5.45 clks
XP1800+:
yodel1: 9253 ticks ~6.93 clks
yodel2: 12308 ticks ~9.21 clks
yodel3: 6839 ticks ~5.12 clks
Scali 18:31: Freaky, one would not expect the P4 to be faster than an Athlon.
Scali 18:31: It's probably the cache.
Scali 18:32: the routines are so short, that you notice the extra clk of latency on the L1 cache of the Athlon?
Scali 18:32: (and with some magic, the P4 manages to hide its high-latency shifter completely.
Okay, time to post my updated yodel benchmark.
Please read readme.txt and results.txt if you're going to comment.
If you're going to post results, rather email them to me and have me post a follow up - keeps stuff less cluttered. There's currently 13 benchmarks (I think) included, more are welcome.
It should be easy for people to play with the benchmark suite, one of the tests allows loading of a DLL. This is not at all optimal, for a bunch of reasons listed in the readme, but it's possible nevertheless.
Looking forward to hearing any comments.
Please read readme.txt and results.txt if you're going to comment.
If you're going to post results, rather email them to me and have me post a follow up - keeps stuff less cluttered. There's currently 13 benchmarks (I think) included, more are welcome.
It should be easy for people to play with the benchmark suite, one of the tests allows loading of a DLL. This is not at all optimal, for a bunch of reasons listed in the readme, but it's possible nevertheless.
Looking forward to hearing any comments.
attaching the zip helps.