It looks from the test result that LEA is very poor on a P4 when the comparison from PIII to PIV is so large.

Most of the technical data indicated that older instructions like INC DEC SHL SHR and it seems now LEA are best avoided on a P4 if you are writing processor specific code.

This is unfortunate as Intel processors from early Pentiums to PIII were reliable on this type of code so it makes producing general purpose code a fair bit harder.

Regards,

hutch@movsd.com
Posted on 2003-04-23 20:34:14 by hutch--
It's a bit annoying indeed - means P4s will run relatively poor on a lot of old optimized code. Didn't really come as a big surprise, though. But it also shows that you can get quite decent performance when you code for the P4. Which doesn't even appear to be that hard to do.

From what I can see, you're best off doing the C-style code if you're targetting p3/p4 and want the code to run on all boxes. The C-style code runs well on the P4 (beating all the "plain op" asm variants), competes well on P3, and isn't all that bad (though somewhat slower) on Athlons. If you're targetting P4, of course go SSE2 - the result speaks for itself :). If you're targetting pmmx/higher, go MMX - very good results on all platforms.

Of course vectorized (cpu runtime detection + function pointers) can be used to use optimal routines at runtime.


Both C variants and the asm MMX/SSE2 have other large benefits: it's *very* easy to change the constant used, without having to recalc a whole bunch of code. Furthermore, MMX and SSE2 can easily be adapted to work with variable multipliers instead of constants. Also, Other (larger) constants than 92 might prove harder to construct good plain-instruction add/lea/whatever algorithms for - might turn up that IMUL would be the best plain instructionset way of doing it then, even on P4.
Posted on 2003-04-24 01:33:09 by f0dder
Of course it should be added that the current tests are very simple, and more routines should be written, especially by athlon owners :). Also, conformance tests should be written - MMX/SSE2 works on signed where the rest is unsigned, this might prove funny ;).

Furthermore, I'm considering making the test suite "somewhat portable", so it can be run natively on linux - that way I can also test on a P3-cel-tualatin-1300. Test results for my P4-cel-1.7ghz will be included in next update.

Hm. I think it's funny the P4 does so well with the ?ber-simple mmx/sse2 versions. clk/op gets better than the best performing routine on athlon. Somebody write something that runs better clk/op on the athlon, please :)
Posted on 2003-04-24 01:49:12 by f0dder
As again MMX and SSE codes impress me
:)

C:\>"C:\Documents and Settings\Administrator\Desktop\yodel_sse2.exe"
--- Yodel version 0.3, 2003/04/23, 22:34
## Test parameters: 1000000 iterations of 2048 muls, total 2048000000 muls
## TIMECRITICAL: your computer will appear frozen. Don't panic.
## Retrieving (NT) or calculating (9x) clockspeed...598 MHz
## running performance tests
test01-simple C++ code ...021813 ticks (effective 6.369 clk/mul)
test02-Ekted 1 ...029844 ticks (effective 8.714 clk/mul)
test03-Roticv 1 ...020609 ticks (effective 6.018 clk/mul)
test04-Hutch 1 ...020594 ticks (effective 6.013 clk/mul)
test05-f0dder imul ...017172 ticks (effective 5.014 clk/mul)
test06-scali MMX ...003469 ticks (effective 1.013 clk/mul)
test07-scali SSE2 ...002172 ticks (effective -1.#IO clk/mul)
test08-DLL ...fail
Posted on 2003-04-24 02:13:39 by roticv
roticv, did you read the readme? And aren't you on a P3 system, why are you trying to run SSE2 then? Also, you should email me the results instead of posting here - and CPU type + OS wouldn't hurt, either.

I'm a bit surprised SSE2 gives the floatinpoint error instead of an invalid opcode exception :confused:
Posted on 2003-04-24 02:17:16 by f0dder
I've tested on a couple machine more now, including pmmx-200 and k6-2 350mhz. It seems that if you're only going to do simplistic add/lea stuff, you're probably better of using imul to get the most "stable" performer. If you don't care about pplain, go mmx.

I guess I do have to add the dislcaimer that "more algorithms should be written", to avoid flames by know-better persons :grin:
Posted on 2003-04-24 03:40:23 by f0dder
I did have a play on my P4 and ended up with some unusual results.

On the two versions I tested, the line of code,


mov [edi],ax

produced a very bad stall that slowed both my second LEA version and the inline version that EkTed posted by 4 to 5 times. The LEA version was more severely effected by the stall.

You can drop the stall to some extent by adding the lines,


xor eax, eax
xor ecx, ecx
xor edx, edx

within the testing loop code and this makes both versions I tested about 20% faster with the line of code that prduces the stall. This may not matter in normal aplication though as I don't have any info on how the data is presented to the algo.

On my internat nachine, an old K6-2 550, the lea version is about 40% faster with the line that creates the stall. If that line is removed, the inline version drops in its time by 75%, the LEA version drops by about 80% so in direct comparison, the inline vesion is about 12% faster on the P4 that the LEA version.

Below is a simple floating point algo that does the multiplication by 92. It may be worth a try.



; ?????????????????????????????????????????????????????????????????????????

mul92a proc

LOCAL number:DWORD
LOCAL multiplier:DWORD

mov number, 10
mov multiplier, 92

push edi

lea edi, number

fild WORD PTR [edi] ; load source
fild multiplier ; load multiplier
fmul ; multiply source by multiplier
fistp WORD PTR [edi] ; store result in variable

ShowReturn hWnd, [edi]

pop edi

ret

mul92a endp

; ?????????????????????????????????????????????????????????????????????????


Regards,

hutch@movsd.com

PS: f0dder, your nose is running again.
Posted on 2003-04-24 06:11:02 by hutch--
Nose running, what? I think I clearly stated that the results were based on the current tests, and that more tests should be written. And also that a lot of the test aren't exactly fair since they're not unrolled etc. I should think that I'm rather objective?

There's a few things the benchmark currently does show, though.
*) SSE2/MMX is great, and there's good reason to believe it will eat anything else. After all, this was what those instruction sets were written for.

*) It's easy to get manually built "simple instruction" stuff wrong, so that it will perform poorly on some machines. Hutch original code on P4, Ekted on non-P4. Furthermore, if the constant changes, you have to retime it by hand.

*) The intel C++ compiler _does_ do a good job of creating "simple instruction" versions, and it's definitely a bit easier changing a single line of CONSTANT than re-writing "simple-instruction" stuff with stalls and all in mind. Sure, the compiler generated code is not the best, but it performs reasonably, and avoids "really bad" situations on all processors.

This is NOT meant to be a C++ vs. asm debate, we're looking at a pretty isolated case. In _this isolated case_, the C++ compiler does a faily reasonable job, while some of the hand-written versions perform very poorly on some platforms (and pretty well on others, of course).

Let's try and focus on writing some good algorithms and get them timed reliably (I'd say yodel with TIMECRITICAL and this amount of iterations is "reliable enough"). I'm working on making a nice and easy to use timing environment which also does conformance tests (yes, the yodel stuff) - the current version is very early yet, but I plan on spending some time on it today which should add conformance test, and the option to run only the DLL part. That way, everybody should be able to easily time their own code, in a way that is directly comparable with the other results posted.

I'll have a look at your floatingpoint routine later, I had planned to do one myself.
Posted on 2003-04-24 06:34:12 by f0dder
f0dder,

there is actually a little bit more to optimisation that dumping a C compiler and making noises about it. You do get a win sometimes but often you don't.

I have done very little optimisation work on a PIV so I don't claim to know the range but its different to a PIII and earlier so as usual, some tuning would have to be done if the code was pointed specifically at a P4 and nothing else.

Testing directly on the P4 showed the big time loss was the stall on the end write and it was severe enough to make both algos run 4 times slower so that is really where the action is, not in trying to interpret the C compiler dump.

Regards,

hutch@movsd.com

PS: Your nose is still running.
Posted on 2003-04-24 06:42:15 by hutch--

there is actually a little bit more to optimisation that dumping a C compiler and making noises about it. You do get a win sometimes but often you don't.

Please re-read my post. It's not about C vs. Asm. It's about writing a bunch of algorithms and see how they perform - finding one that works well on all architectures, and some optimized special cases for certain models. I noticed the C code does well on all machines tested on, while some of the asm versions vary much.


I have done very little optimisation work on a PIV so I don't claim to know the range but its different to a PIII and earlier so as usual, some tuning would have to be done if the code was pointed specifically at a P4 and nothing else.

Yep, it's quite different, and I'm not trying to flame you because your code performs poorly on P4 - it performs well (good end of middle range) on all other machines, pmmx, k6-2, etc. Scali immediately said "hutches code sucks", I didn't. Please try to show me the same level of objectivity, and refrain from silly remarks such as "your nose is running".

What I can see with the current results is that it's possible to write some generic code that runs fairly well on all architectures (around middle end), code that works really well on either P4 or Athlon, but performs poorly on certain other architectures. And that MMX/SSE2 currently seems to be the way to go.


Testing directly on the P4 showed the big time loss was the stall on the end write and it was severe enough to make both algos run 4 times slower so that is really where the action is, not in trying to interpret the C compiler dump.

Forget about the "trying to interpret the C compiler dump" part. Again, it's not about C vs. Asm, and I hope you will be mature enough not to try converting this discussion into such. C version is provided since:

*) the test-bed is written in C
*) it's very easy to verify the validness of the C code, and it can thus be used for conformance tests. Might as well have used a simple asm routine for this, but I wrote it in C. Big deal.

However, I don't see any reason not to include the C version in the benchmarks. It's "just another implementation". It performs reasonably well, benchmarks show that. Does that mean I'm saying "C 0wnz Asm suxx no more algoz have to be written"? NO.

I wish to find good routines, both generic and optimized for special architectures, and see how they perform across the different architectures. I am trying to set up a benchmarking framework that is easy to use, and everybody (who knows how to write a DLL) can plug stuff into, regardless of language, so that we have common grounds for comparing benchmarks.
Posted on 2003-04-24 06:58:07 by f0dder
Oh, and let's try to _cooperate_ this time, shall we not?
Posted on 2003-04-24 07:00:15 by f0dder
Waddya mean "we" paleface ? :tongue:

When you spare me the infantile lectures, I will show you how to use a tissue to blow your nose.

Regards,

hutch@movsd.com
Posted on 2003-04-24 07:03:54 by hutch--
...
Posted on 2003-04-24 07:11:54 by JCP
Scali immediately said "hutches code sucks", I didn't.

I suppose Scali meant that hutch's code does not work well on P4, but it seems that lea works pretty alright on P3, as far as what the tests told me. It seems that on P4, shifts and adds work as good as imul does, but on P3 the shifts and lea are on par. Weird I would say. But seriously the MMX code impressed me. Multiplication on MMX is so fast :grin:

there is actually a little bit more to optimisation that dumping a C compiler and making noises about it. You do get a win sometimes but often you don't.

Ah, spare me the flames about C compilers and such (Leave it for another thread or whatsoever). Better to go test other opcodes than blickering about HLL and Low level language.
Posted on 2003-04-24 08:24:31 by roticv

I suppose Scali meant that hutch's code does not work well on P4, but it seems that lea works pretty alright on P3, as far as what the tests told me.

Actually, I think scali just saw the P4 results, and then classified hutches code as "suck", especially since it's from hutch. I noticed the code was bad on P4, and decided to test on a bunch of other platforms instead of just saying "it sucks".


It seems that on P4, shifts and adds work as good as imul does, but on P3 the shifts and lea are on par. Weird I would say. But seriously the MMX code impressed me. Multiplication on MMX is so fast :D

P4 is a bit weird compared to older processors - a lot of previously written code runs bad. However, it would seem that with properly written code, it should be able to perform _rather_ decently (MMX and SSE2 shows this!) - whether athlons or P4 can reach the highest performance for this task is not yet concluded - more code needs to be written.


Ah, spare me the flames about C compilers and such (Leave it for another thread or whatsoever). Better to go test other opcodes than blickering about HLL and Low level language.

Thanks, I really appreciate that comment. Wish hutch was as mature.
Posted on 2003-04-24 08:32:10 by f0dder
Ok, modified hutches fmul example a bit (see below), these were the timings.
P4-2.53ghz: test10-hutch fmul 1 ...008873 ticks (effective 3.033 clk/mul)
Athlon700: test10-hutch fmul 1 ...006469 ticks (effective 8.029 clk/mul)

Not surprisingly, fmul sucks on P4 and runs fine on athlon. This is
in default FPU setup - I should probably set precision to single,
and perhaps rounding to chop? Might improve the situation, worth a try.
Perhaps there's also better ways to do it with pure FPU? Too bad pure
x87 FPU is bad on P4 (still need to fiddle with control flags though).
Also, FPU timings will have to be made on more platforms (if FMUL is
to be considered a generic instructionsset version, it has to run
reasonably on pmmx and k6 machines).

code, somewhat modified from hutches initial idea:


CONSTANT dd 92
_time10@4:
push esi
push edi

mov esi, [esp+12]
mov edi, 2048 - 1
fild DWORD [CONSTANT] ; load multiplier - outside loop, of course.
.loop:
; code somewhat modified from hutch begin
fild WORD [esi + (edi*2)] ; load source
fmul st0 ; multiply source by multiplier
fistp WORD [esi + (edi*2)] ; store result in variable
; code end

sub edi, 1
jnz .loop

fstp st0 ; clean up FP stack

pop edi
pop esi
mov eax, 1
ret 4
Posted on 2003-04-24 10:42:20 by f0dder
Scali's first attempt at Athlon optimizing. Beats the other
"simple instruction" versions, doesn't beat imul, and (of course :P)
is totaly owned by the MMX version. Please join the fray if
you like pondering with code optimization and can do a better
job at athlon optimizing!

P4-2.53: test11 ...005359 ticks (effective 6.652 clk/mul)
Athlon700:test11 ...013399 ticks (effective 4.580 clk/mul)
XP1800: test11 ...006069 ticks (effective 4.525 clk/mul)



_time11@4:
push esi
push edi

mov esi, [esp+12]
mov edi, 2048 - 1


.loop:
movzx eax, word [esi + (edi*2)]

; scali code begin
lea edx, [eax*4] ; *4
shl eax, 7 ; *128
sub eax, edx
lea edx, [edx*8] ; *32
sub eax, edx
; scali code end

mov [esi + (edi*2)], ax

sub edi, 1
jnz .loop

pop edi
pop esi
mov eax, 1
ret 4
Posted on 2003-04-24 10:54:06 by f0dder
there's a bug in the FP code I posted, "fmul st0" should of course be "fmul st1".
I've implemented some conformance tests, and as I expected all the routines failed, because I did wrong array indexing - silly me. Fixed, performance changes a (very) little, but it shouldn't change the overall view at all.

A thing to note: all the routines (even mmx and sse2) handle multiplication overflows in the same way as the C version - except for the FPU version. I assume this is because it stores a signed word value? By zero-extending the word value, storing to a temp dword, FMUL, storing to dword, loading eax, storing word, it yields correct results for overflows too - but goes from 8clk/op to 13.7 clk/op on my P4. There's probably a more efficient way of doing this, and perhaps fiddling with rounding modes can fix it.
Posted on 2003-04-24 11:42:37 by f0dder
further tweaks:
Changing "complex effective address" to "simple effective address"
(with additional add esi, 2) made the code go from 5.140 clk/op to 5.120 clk/op
on my P4-2.53 ghz, but 4.073 clk/op to 4.077 clk/op on my athlon700.

- I guess all of this should be summed up and put in the readme or summat :)
Posted on 2003-04-24 12:05:07 by f0dder
Wow, all this from a simple question of a test case I was playing with. :) Thanks for all the effort you guys took to help me out! I don't think it's worth making a master's thesis over though. :)
Posted on 2003-04-24 12:09:34 by Ekted