I'm currently optimizing a program for the Pentium 4. A very weird (and new for me) thing has happened to me today, which I will tell you.

There is a function which does the multiplying of two numbers in a finite field (not normal multiplication)... whatever, that isn't important. The current version of the function is a MMX one. I made a SSE2 function which does twice the work of the MMX one, meaning that it does two multiplications at once (sometimes called 2-piped).

So, in the places where the MMX function was being used twice in a row, I replaced those 2 calls by a single call to the SSE2 function. Turned out the overall speed of computation was about 4% slower! I was of course expecting it to be faster than the previous one!

Now the funny part:

So, I went to benchmark the SSE2 function, to see if it was more than 2 times slower than the MMX one (which, if true, would explain the decrease in speed).

Results:
20 million calls of the MMX function - 4.5 seconds
10 million calls of the SSE2 function (does 20 million multiplications) - 4.1 seconds!!

Meaning -> I made a faster function, which makes the program actually slower! Has this ever happened to you? Do you have any idea why this happens?

Thanks for reading this stuff, I normally don't like to annoy people with my programming problems, but this is making me crazy...
Posted on 2003-01-21 18:08:41 by Knightmare
Knightmare, don't apologize! This is exactly the thing I like talking about! :grin: I'd have to know more about your program before saying much - I'd guess it is memory access slowing you down.
Posted on 2003-01-21 18:21:47 by bitRAKE
Yes i did happen to me many times :)

Never trust your "pure logic" for making something better/faster... trust it for creating theory but always experiment with it in bare practice. Also be fair in experimenting mind can (and will) play you nasty tricks.

For example your test is a good one: 20million calls is a fair overall test... but 10000 calls can generate a different result.

Also some algorithms might suggest that they are better from a LOGICAL POINT OF VIEW but prove to be very wrong in real world implementation because of memory/cache or CPU organization.
Posted on 2003-01-21 18:27:16 by BogdanOntanu
I'm :confused:

You said you converted MMX code to SSE2 code, which means you moved from integer operation to floating point operation. Without code, I don't know what you are trying to achieve, but, it is not hard to imagine that floating point op is slower than integer op (assuming all the other things are equal, like decoding pipeline, ...).

What confuses me most is that you said you made the throughput twice higher by moving from MMX to SSE2. Did you mean FPU rather than MMX?

If you meant FPU, I can imagine the result you have. Replacing FPU op by SSE2 op may have adverse effect. I don't have a concrete test result, but circumstantial evidence from ATLAS. ATLAS (if you don't know what it is, it is a library handling vectors and matrices) configured for P4 with SSE2 is observed to be slower than other configuration (like "optimized" for PIII) even on P4.
Posted on 2003-01-21 18:28:28 by Starless
Actually this is an open source program. It's for a distributed computing project called ecc2-109 (for a certicom challenge, breaking 109 bit elliptic curve encryption). The first version of the core was rather slow on Pentium 4. I stepped in the project and, by converting the multiply function from C with inline assembly code to pure asm (and applying a few more tricks), I managed to get about the following optimizations:

+70% on the P4
+10 to 15% on most other CPU's (all except K6's, which got slower with the new core)

You can get information and the source code of my first optimised version (only uses MMX) at:

http://www.ecompute.org/ecc2 (more information on http://www.ecc2.com too)

The function in cause is on the mult109_mmx.asm file, there is a C version (which is only used on computers without mmx) on the mult109_c.c file.

I thought about this possible problem - now that 2 different versions of the multiply func are used (SSE2 one only used when 2 mults are wanted), the cache sometimes has to load one, and sometimes other...


Starless - SSE2 is almost equal to MMX, it's not floating point. Maybe you're confusing with pure SSE (pentium 3) which is floating point. The SSE2 function is almost equal to the mmx one, except that it has 2 intermediate results on the two halves of the xmm 128-bit registers
Posted on 2003-01-21 18:31:43 by Knightmare
Knightmare,

I have not played with SIMD all that much but where I have written a few test pieces to benchmark the difference, you can get perforance improvements with some operations using MMX registers over the standard integer registers but I have not found SIMD to be any faster than MMX.

As Rickey mentioned, the access speed of memory may be a factor that flattens out the instruction's actual performance but I don't think that will change in the near future.

I am far more inclined to trust real time benchmarking than theory.

Regards,

hutch@movsd.com
Posted on 2003-01-21 22:30:03 by hutch--
Knightmare, I just took a quick look, but you should be able to reduce the number of shifts in the single mult version by using SSE2 - shifts are slow. I presume there is a multi-precision number in several (4?) MMX registers that could be contained in only a couple XMM registers. The dependancies would go up, but then the 2 mult version would be faster as well. Maybe, this is what you have already done?
Posted on 2003-01-22 00:00:41 by bitRAKE
Bitrake: Yes, I already did that. Using SSE2 for a single multiplication proved to be slower because there is no bit-by-bit shift instruction for the whole 128 bit-register, there's only pslldq and psrldq which shift byte-by-byte... So, making a 2-piped function proved to be best choice - the SSE2 function is almost equal to the mmx one, the only difference is that it's processing each half of the xmm register exactly the same as in the MMX function, except it contains it twice and does each operation on both halves of the register with the same number of instructions as the MMX... Example, where I used "psllq mm0, 1" now I use "psllq xmm0, 1" which does the same as the former instruction, but twice.

Hutch--:

I wish the discrepance was between theory and real benchmarking! The problem here is a function which, by benchmarks, shows to be faster than the previous one, yet, when used in the whole program, proves to make it slower! I've seen theory contradict practice many times when optimizing, but it's the first time I make a faster function which makes a program slower, which really makes me sad...
Posted on 2003-01-22 03:49:38 by Knightmare
One possible discrepancy is that your little benchmark doesn't jump around like your app code. Cache alignment is also important, and bad memory alignment can spoil cache performance. In addition to the L1 and L2 (memory) caches, there is also the disk cache (used when paging).

I used to use a 486 based machine, where it was worse -- code and data shared the same L1 cache. Adding dead code functions often improved program performance, simply because it changed the relative alignments of different sections of code and data.

Other than the size and number of cache lines, and the separation of code and data caches, I believe the Pentium caches work pretty much the same as the 486 cache.
Posted on 2003-01-22 16:14:15 by tenkey

Hi Knightmare:
SSE2 is more an investiment for the future, than an improvement of the present.
The Pentium 4 does *not* have any 128 bit unit inside itself, so when you do 128 bit math with SSE2 instructions, it simply split it internally into two 64 bit operations. Hence the "half speed" (or same speed of 64 bit instructions).
Nothing denies, though, that the Pentium 5 will have true 128 bit units, and thus the "old" Pentium 4 software will run twice as fast.

But, today, no..

PS: dump that Pentium 4.. the K7/K8 is better. :grin:
Posted on 2003-01-22 16:35:42 by Maverick
Maverick: That may be true in part, but I'm not complaining about SSE2 speed, since, when benchmarked individually, the SSE2 function proves to be faster... Some other weirdness like the cache problem tenkey talked about is causing this, I don't even think it has anything to do with SSE2 itself ;)
Posted on 2003-01-22 16:59:50 by Knightmare

Oki :)
Posted on 2003-01-23 01:40:20 by Maverick
Knightmare,

One of the things that I have learnt with algorithms over time is that code alignment rarely ever matters, there are too many other things that get in the road but just occasionally it will surprise you and you get a noticable jump in performance.

I would be tempted to put "align 16" at the front of a number of procedures to see if you get an improvement. It may not effect it at all but on occasions it will solve a problem that comes from the actual physical order of the source file.

I have just found it occasionally while benchmarking algos, change the code in one algo and one further down the file goes slower.

Regards,

hutch@movsd.com
Posted on 2003-01-23 04:35:08 by hutch--
Thanks hutch, I'll try playing with that :)
Posted on 2003-01-23 06:16:43 by Knightmare
I've found out where the guilty code is, I mean, the code which got slower because of using the faster function:
----------------------------------------------------
for (i=1; i<PARAL_POINTS; i++) {
c.t = c.t; c.h = c.h;
c.m = c.m; c.l = c.l;
multiply(&c, &x);
}

u.t = c.t; u.h = c.h;
u.m = c.m; u.l = c.l;
invert(&u);

for (i=PARAL_POINTS-1; i>0; i--) {
c.t = c.t; c.h = c.h;
c.m = c.m; c.l = c.l;
#ifdef USE_SSE2_MULTIPLY
multiply2 (&c, &u, &u, &x);
#else
multiply(&c, &u);
multiply(&u, &x);
#endif
}
----------------------------------------------------

FYI, PARAL_POINTS is 64, so the loops are not very short.

This is the original code, meaning that it only uses the single-multiply MMX function called "multiply". You can see that in the first for, only one multiplication is done, so on the SSE2 version, that first "for" loop hasn't been changed (and no, the loop can't be unrolled because each iteration depends on the previous one). So, to use the SSE2 function, I changed the two calls to multiply in the second for loop to a single call to the SSE2 2-piped multiply function called "multiply2". When I do this, the whole function gets slower, despite the fact that, when benchmarked individually, a call to multiply2 proves to be faster than 2 calls to multiply.

Having successfully found where the problem is, I then tried out to benchmark *only* the second for loop, which, as I expected, is faster using the new multiply2 function. So, the cause of the slowing down is to use both the multiply and the multiply2 function together in invertPar (though they are one on each loop), so I suspect this is a code cache problem (when the multiply2 function is used the first time, it has to be loaded on the cache). Any suggestions?
Posted on 2003-01-23 12:11:18 by Knightmare
I've met this!
I wrote a function, but it was much slower than the c++ version.
It was so much confused me. But I found it days after that
my asm function would cause float stack overflow exception
frequently. The exception handling routine eat up the CPU time!

After I modified my code to achieve a correct manipulation of float
stack, my asm function is much faster than it's c++ version.

Oh!! You can see, asm is so good!!

Wish you happy and success!
Posted on 2003-01-26 05:01:31 by alphasun